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EDITOR’S PREFACE 


This book is a concise introduction to modern probability 
theory and certain of its ramifications. By deliberate 
succinctness of style and judicious selection of topics, it 
manages to be both fast-moving and self-contained. 

The present edition differs from the Russian original 
(Moscow, 1968) in several respects: 


1. It has been heavily restyled with the addition of 
some new material. Here I have drawn from my own 
background in probability theory, information theory, 
etc, 


2. Each of the eight chapters and four appendices has 
been equipped with relevant problems, many accom- 
panied by hints and answers. There are 150 of these 
problems, in large measure drawn from the excellent 
collection edited by A. A. Sveshnikov (Moscow, 1965). 


3. At the end of the book I have added a brief 
Bibliography, containing suggestions for collateral and 
supplementary reading. 

R.A. S. 
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BASIC CONCEPTS 


I. Probability and Relative Frequency 


Consider the simple experiment of tossing an unbiased coin. This 
experiment has two mutually exclusive outcomes, namely “heads” and 
“tails.” The various factors influencing the outcome of the experiment are 
too numerous to take into account, at least if the coin tossing is “fair.”” 
Therefore the outcome of the experiment is said to be “random.” Everyone 
would certainly agree that the “probability of getting heads” and the “prob- 
ability of getting tails’ both equal }. Intuitively, this answer is based on the 
idea that the two outcomes are “equally likely”’ or “equiprobable,” because of 
the very nature of the experiment. But hardly anyone will bother at this 
point to clarify just what he means by “probability.” 

Continuing in this vein and taking these ideas at face value, consider an 
experiment with a finite number of mutually exclusive outcomes which are 
equiprobable, i.e., “equally likely because of the nature of the experiment.” 
Let A denote some event associated with the possible outcomes of the 
experiment. Then the probability P(A) of the event A is defined as the fraction 
of the outcomes in which A occurs. More exactly, 


Nia) 
4), 


where N is the total number of outcomes of the experiment and N(A) is the 
number of outcomes leading to the occurrence of the event A. 


P(A) = (1.1) 


Example 1. \n tossing a well-balanced coin, there are N = 2 mutually 
exclusive equiprobable outcomes (“‘heads” and “tails’”). Let A be either of 
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these two outcomes. Then N(A) = 1, and hence 


1 
P(A)=-. 
(=5 
Example 2. In throwing a single unbiased die, there are N = 6 mutually 
exclusive equiprobable outcomes, namely getting a number of spots equal 
to each of the numbers | through 6, Let A be the event consisting of getting 
an even number of spots. Then there are N(A) = 3 outcomes leading to the 
occurrence of A (which ones?), and hence 
cee! 
P(4) =+=-. 

A=5=5 
Example 3. In throwing a pair of dice, there are N = 36 mutually 
exclusive equiprobable events, each represented by an ordered pair (@, 6), 
where a is the number of spots showing on the first die and 6 the number 
showing on the second die. Let A be the event that both dice show the same 
number of spots. Then A occurs whenever a = J, i.e., n(A) = 6. Therefore 


6 1 
P(4)= 3. =3. 


Remark. Despite its seeming simplicity, formula (1.1) can lead to 
nontrivial calculations. In fact, before using (1.1) in a given problem, we 
must find all the equiprobable outcomes, and then identify all those leading 
to the occurrence of the event A in question. 


The accumulated experience of innumerable observations reveals a 
remarkable regularity of behavior, allowing us to assign a precise meaning 
to the concept of probability not only in the case of experiments with equi- 
probable outcomes, but also in the most general case. Suppose the experi- 
ment under consideration can be repeated any number of times, so that, in 
principle at least, we can produce a whole series of “independent trials under 
identical conditions,”* in each of which, depending on chance, a particular 
event A of interest either occurs or does not occur. Let n be the total number 
of experiments in the whole series of trials, and let n(A) be the number of 
experiments in which A occurs. Then the ratio 


nd) 


n 


is called the relative frequency of the event A (in the given series of trials). It 
turns out that the relative frequencies n(A)/n observed in different series of 


* Concerning the notion of independence, see Sec. 6, in particular footnote 2, p. 31. 
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trials are virtually the same for large n, clustering about some constant 


Pia (1.2) 
n 


called the probability of the event A. More exactly, (1.2) means that 


P(A) = lim") A (1.3) 
no N 
Roughly speaking, the probability P(A) of the event A equals the fraction of 
experiments leading to the occurrence of A in a large series of trials.” 


Example 4. Table 1 shows the results of a series of 10,000 coin tosses,* 
grouped into 100 different series of nm = 100 tosses each. In every case, the 
table shows the number of tosses n(A) leading to the occurrence of a head. 
It is clear that the relative frequency of occurrence of “heads” in each set 
of 100 tosses differs only slightly from the probability P(A) = } found in 
Example 1, Note that the relative frequency of occurrence of “heads” is even 
closer to } if we group the tosses in series of 1000 tosses each. 


Table 1. Number of heads in a series of coin tosses 


Number of heads Number of heads 

in 100 series of in 10 series of 

100 trials each 1000 trials each* 
54 46 53 5S 46 54 41 48 SI 53 501 
48 46 40 53 49 49 48 S54 53 45 485 
43 52 58 51 S51 50 52 50 53 49 509 
58 60 54 55 SO 48 47 57 52 55 536 
48 S1 51 49 44 52 50 46 53 41 485 
49 50 45 52 52 48 47 47 47 SI 488 
45 47 41 SI 49 59 50 55 53 50 500 
53 52 46 52 44 51 48 S51 46 54 497 
45 47 46 52 47 48 59 57 45 48 494 
47 41 Sl 48 59 S1 S52 55 39 41 484 


Example 5 (De Mére’s paradox). As a result of extensive observation of 
dice games, the French gambler de Méré noticed that the total number of 
spots showing on three dice thrown simultaneously turns out to be 11 (the 
event A,) more often than it turns out to be 12 (the event Ay), although 
from his point of view both events should occur equally often. De Méré 


® For a more rigorous discussion of the meaning of (1.2) and (1.3), see Sec. 12 on the “law 
of large numbers.” 

8 Table 1 is taken from W. Feller, An Introduction to Probability Theory and Its Appli- 
cations, Volume I, third edition, John Wiley and Sons, Inc., New York (1968), p. 21, and 
actually stems from a table of “random numbers.” 

4 Obtained by adding the numbers on the left, row by row. 
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reasoned as follows: A, occurs in just six ways (6:4:1, 6:3:2, 5:5:1, 5:4:2, 
5:3:3, 4:4:3), and A, also occurs in just six ways (6:5:1, 6:4:2, 6:3:3, 
5:5:2, 5:4:3, 4:4:4). Therefore A, and A, have the same probability 
P(A,) = P(A). 

The fallacy in this argument was found by Pascal, who showed that the 
outcomes listed by de Méré are not actually equiprobable. In fact, one must 
take account not only of the numbers of spots showing on the dice, but also 
of the particular dice on which the spots appear. For example, numbering 
the dice and writing the number of spots in the corresponding order, we find 
that there are six distinct outcomes leading to the combination 6:4:1, namely 
(6, 4, 1), (6, 1, 4), (4, 6, 1), (4, 1, 6), (1, 6, 4) and (1, 4, 6), whereas there is 
only one outcome leading to the combination 4:4:4, namely (4, 4, 4). The ap- 
propriate equiprobable outcomes are those described by triples of numbers 
(a, b, c), where a is the number of spots on the first die, b the number of spots 
on the second die, and c the number of spots on the third die. It is easy to 
see that there are then precisely N = 6* = 216 equiprobable outcomes. Of 
these, N(A,) = 27 are favorable to the event A, (in which the sum of all the 
spots equals 11), but only N(A,) = 25 are favorable to the event A, (in which 
the sum of all the spots equals 12).° This fact explains the tendency observed 
by de Méré for 11 spots to appear more often than 12, 


2. Rudiments of Combinatorial Analysis 


Combinatorial formulas are of great use in calculating probabilities. 
We now derive the most important of these formulas. 


THEOREM 1.1. Given n, elements ay, do,..., a, and nz elements by, 
y Dagens ony there are precisely myn, 
distinct ordered pairs (a,, b;) contain- 


bn, ing one element of each kind. 

Proof. Represent the elements of 
the first kind by points of the x-axis, 
and those of the second kind by points 

be of the y-axis. Then the possible pairs 
b, (a,, 6;) are points of a rectangular 


lattice in the xy-plane, as shown in 
14 __4 __,__ +x Figure 1, The fact that there are 
sae i just mn, such pairs is obvious from 
Ficure 1, the figure. Jf 


5 To see this, note that a combination a:b: occurs in 6 distinct ways if a, b and ¢ are 
distinct, in 3 distinct ways if two (and only two) of the numbers a, b and ¢ are distinct, and 
in only 1 way if a = b = c. Hence A, occurs in 6 + 6 + 3 + 6 + 3 + 3 = 27 ways, while 
A, occurs in6+6+3+3+6-+ 1 =25 ways. 

® The symbol J stands for Q.E.D. and indicates the end of a proof. 


SEC. 2 BASIC CONCEPTS 5 


More generally, we have 


THEOREM 1.2, Given n, elements ay, do,..., 4p, Nz elements by, 
ba, +5 Dy efc., up to n, elements X,, Xa, +++ 5 Xn,» there are precisely 
nyn,***n, distinct ordered r-tuples (a;,,b;,,...,X;,) containing one 
element of each kind.” 


Proof. For r= 2, the theorem reduces to Theorem 1.1. Suppose 
the theorem holds for r — 1, so that in particular there are precisely 


ng* +n, (r — 1)-tuples (b;,,..., %;,) containing one element of each 
kind. Then, regarding the (r — 1)-tuples as elements of a new kind, we 
note that each r-tuple (4,,, b;,,..., ¥;,) can be regarded as made up of 
a (r — 1)-tuple (6;,,..., ¥;,) and an element a,,. Hence, by Theorem 


1.1, there are precisely 
My(Ng*** Ny) = MyNg "Ny 


r-tuples containing one element of each kind. The theorem now 
follows for all r by mathematical induction. J 


Example 1. What is the probability of getting three sixes in a throw 
of three dice? 


Solution. Let a be the number of spots on the first die, b the number of 
spots on the second die, and ¢ the number of spots on the third die. Then 
the result of throwing the dice is described by an ordered triple (a, b,c), 
where each element takes values from 1 to 6. Hence, by Theorem 1.2 with 
r= 3andn, =n, = ng = 6, there are precisely N = 6° = 216 equiprobable 
outcomes of throwing three dice (this fact was anticipated in Example 5, 
p. 3). Three sixes can occur in only ‘one way, i.e., when a = Vise mG: 
Therefore the probability of getting three sixes is sts. 


Example 2 (Sampling with replacement). Suppose we choose r objects 
in succession from a “population” (i.e., set) of n distinct objects a), da,..., 
a, in such a way that after choosing each object and recording the choice, we 
return the object to the population before making the next choice. This 
gives an “ordered sample” of the form 


(Gj,5 ign + + 9 %,) (1.4) 
Setting n) = n,=+++=n,=n in Theorem 1.2, we find that there are 
precisely 
N=n (1.5) 
distinct ordered samples of the form (1.4).* 
7 Two ordered r-tuples (4;,, big, « - «5 Xi-) and (aj, bjg, .. . , Xj,) are said to be distinct 
if the elements of at least one pair @;, and aj;, bj, and bys, ... , ai, and b;, are distinct. 
5 Two “ordered samples” (4,1, dig, .. . , @i,) and (@;,, djg,... , @j,) are said to be distinct 
if a,, # @;, for at least one kK = 1,2,..., 7. This is a special case of the definition in 


footnote 7. 
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Example 3 (Sampling without replacement). Next suppose we choose r 
objects in succession from a population of n distinct objects a, ds,..., dn, 
in such a way that an object once chosen is removed from the population. 
Then we again get an ordered sample of the form (1.4), but now there are 
n — | objects left after the first choice, nm — 2 objects left after the second 
choice, and so on. Clearly this corresponds to setting 


=n, m=n—l,..., »=n—r+l 


in Theorem 1.2. Hence, instead of n” distinct samples as in the case of sam- 
pling with replacement, there are now only 


N=n(n —1)--:(n—r+1) (1.6) 
distinct samples. If r =n, then (1.6) reduces to 
N=nn—1)+++2-l=n!, (L.7) 


the total number of permutations of n objects. 


Example 4. Suppose we place r distinguishable objects into n different 
“cells” (r < n), with no cell allowed to contain more than one object. Num- 
bering both the objects and the cells, let i, be the number of the cell into which 
the first object is placed, i, the number of the cell into which the second 
object is placed, and so on. Then the arrangement of the objects in the cells 
is described by an ordered r-tuple (j,, i,,...,i,). Clearly, there are ny = 7 
empty cells originally, ng = n — | empty cells after one cell has been occupied, 
nz = n — 2empty cells after two cells have been occupied, and so on. Hence, 
the total number of distinct arrangements of the objects in the cells is again 
given by formula (1.6). 


Example 5. A subway train made up of n cars is boarded by r passengers 
(r <n), each entering a car completely at random. What is the probability 
of the passengers all ending up in different cars? 


Solution. By hypothesis, every car has the same probability of being 
entered by a given passenger. Numbering both the passengers and the cars, 
let i, be the number of the car entered by the first passenger, i, the number 
of the car entered by the second passenger, and so on. Then the arrangement 
of the passengers in the cars is described by an ordered r-tuple (i,, ig, ... , i,), 


where each of the numbers i,, /:,...,7, can range from 1 to n. This is 
equivalent to sampling with replacement, and hence, by Example 2, there are 
N=n 


distinct equiprobable arrangements of the passengers in the cars. Let A 
be the event that “no more than one passenger enters any car.” Then A 
occurs if and only if all the numbers i,, i2,...,i, are distinct. In other 
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words, if A is to occur, the first passenger can enter one of n cars, but the 
second passenger can only enter one of n — | cars, the third passenger one 
ofn — 2cars, and so on. This is equivalent to sampling without replacement, 
and hence, by Example 3, there are 

N(A) =n —1)°°:(@—r+1) 
arrangements of passengers in the cars leading to the occurrence of A. There- 
fore, by (1.1), the probability of A occurring, i.e., of the passengers all ending 
up in different cars, is just 

pA) = 2 “(n—r+1) 

Ss ms 4 


Any set of r elements chosen from a population of » elements, without 
regard for order, is called a subpopulation of size r of the original population. 
The number of such subpopulations is given by 


TueEoreEM 1.3. A population of n elements has precisely 


ee 
_— ri(n—r)! - 


subpopulations of sizer <n. 

Proof. If order mattered, then the elements of each subpopulation 
could be arranged in r! distinct ways (recall Example 3). Hence there 
are r! times more “ordered samples” of r elements than subpopulations 
of size r. But there are precisely n(n — 1) +++ (n — r + 1) such ordered 
samples (by Example 3 again), and hence just 


n(n — 1)++:(n»—r+1) n! 
r! ri(n—r)! 


subpopulations of sizer. J 


Remark. An expression of the form (1.8) is called a binomial coefficient, 
often denoted by 
(' 
r 


instead of C®. The number C?" is sometimes called the number of combinations 
of n things taken r at a time (without regard for order). 


The natural generalization of Theorem 1.3 is given by 


THEOREM 1.4. Given a population of n elements, let ny, M,,...,, be 
positive integers such that 


My Ng ot + my = 1. 


8 


BASIC CONCEPTS CHAP, 1 


, Then there are precisely 


ee (1.9) 
ny! ng! +++ n,! 
ways of partitioning the population into k subpopulations, of sizes 
My, Ng, .. + , Np, respectively. 

Proof. The order of the subpopulations matters in the sense that 
my = 2, nmg=4,ng,..., my, and m = 4, ny = 2,M5,...,m (say) 
represent different partitions, but the order of elements within the 
subpopulations themselves is irrelevant. The partitioning can be effected 
in stages, as follows: First we form a group of m, elements from 
the original population. This can be done in 


Ny =Cy, 
ways. Then we form a group of n, elements from the remaining n — n, 
elements. This can be done in 


Ny = Cn" 


ng 
ways. Proceeding in this fashion, we are left with n — ny — +++ — my» = 
n,1 + nm, elements after k — 2 stages. These elements can be partitioned 
into two groups, one containing 1,_, elements and the other 7, elements, 
in 
Nyy = Chote 


My 
ways. Hence, by Theorem 1.2, there are 
N = NiNg* ++ Nya 
eh ee ae 
distinct ways of partitioning the given population into the indicated k 
subpopulations. But 


CrGrge sae (Cpr ay 
n! (n — ny)! zx3 (n—n,—*'+—m,9)! 
ny! (n—ny)! ng! (n—ny— My)! ya! (Nn — ny — +++ — m_g — 4)! 
n! (n—n,)! Wane See 
ny! (n — ny)! ng! (n — ny — ng)! Nyy! n,! 


in keeping with (1.9). JJ 
Remark. Theorem 1.4 reduces to Theorem 1.3 if 


k=2, n=r, m=n—r. 
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The numbers (1.9) are called multinomial coefficients, and generalize the 
binomial coefficients (1.8). 


Example 6 (Quality control). A batch of 100 manufactured items is 
checked by an inspector, who examines 10 items selected at random. If 
none of the 10 items is defective, he accepts the whole batch. Otherwise, the 
batch is subjected to further inspection. What is the probability that a batch 
containing 10 defective items will be accepted? 


Solution. The number of ways of selecting 10 items out of a batch of 
100 items equals the number of combinations of 100 things taken 10 at a time, 
and is just 
100! 


N=ci? =——., 
0 101.90! 


By hypothesis, these combinations are all equiprobable (the items being 
selected “‘at random”). Let A be the event that “the batch of items is accepted 
by the inspector.” Then A occurs whenever all 10 items belong to the set of 
90 items of acceptable quality. Hence the number of combinations favorable 
to Ais 

90! 


N(A) = C8 = : 
(4) = Cio = tor go! 


It follows from (1.1) that the probability of the event A, i.e., of the batch 
being accepted, equals® 


N(A) _ 90! 90! e+ (i al 


P(A) ~ 
N  80!100! 91-92-+- 100 


where e = 2.718... is the base of the natural logarithms. 


Example 7. What is the probability that two playing cards picked at 
random from a full deck are both aces? 


Solution. A full deck consists of 52 cards, of which 4 are aces. There are 


52! 


Cc? = 
SOSH! 


= 1326 


ways of selecting a pair of cards from the deck. Of these 1326 pairs, there are 


® The symbol ~ means “‘is approximately equal to.” 


10 BASIC CONCEPTS CHaP. 1 


consisting of two aces, Hence the probability of picking two aces is just 


Ce 1326 «(221 


Example 8. What is the probability that each of four bridge players 
holds an ace? 


Solution. Applying Theorem 1.4 with nm = 52 andn, = ny = ng = My 
13, we find that there are 


52! 
13! 13! 13! 13! 
distinct deals of bridge. There are 4! = 24 ways of giving an ace to each 
player, and then the remaining 48 cards can be dealt out in 
48! 
12! 12! 12! 12! 
distinct ways. Hence there are 
48) 
(1214 
distinct deals of bridge such that each player receives an ace. Therefore the 
probability of each player receiving an ace is just 
4 48b (13* _ __24(13)* 
(12!)* 52! 52-51-50: 49 
Remark. Most of the above formulas contain the quantity 
ni =n(n—1)--+2-1, 


called n factorial. For large n, it can be shown that!” 


w 0.105. 


nl ~V/2nn n"e-", 


This simple asymptotic representation of n! is known as Stirling's formula.™ 


PROBLEMS 


1, A four-volume work is placed in random order on a bookshelf. What is the 
probability of the volumes being in proper order from left to right or from 
right to left? 


%° The symbol ~ between two variables «, and $,, means that the ratio ,/, — 1 as 
n>, 

4 Proved, for example, in D. V, Widder, Advanced Calculus, second edition, Prentice- 
Hall, Inc., Englewood Cliffs, N.J. (1961), p. 386. 
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2. A wooden cube with painted faces is sawed up into 1000 little cubes, all of the 
same size. The little cubes are then mixed up, and one is chosen at random 
What is the probability of its having just 2 painted faces? 

Ans, 0,096. 


3. A batch of n manufactured items contains k defective items. Suppose m 
items are selected at random from the batch. What is the probability that / of 
these items are defective? 


4. Ten books are placed in random order on a bookshelf. Find the probability 
of three given books being side by side: 
Ans. is. 


5. One marksman has an 80% probability of hitting a target, while another has 

only a 70% probability of hitting the target. What is the probability of the 

target being hit (at least once) if both marksman fire at it simultaneously ? 
Ans. 0.94. - 


6. Suppose people sit down at random and independently of each other in an 
auditorium containing» + k seats. What is the probability that m seats specified 
in advance (m < n) will be occupied? 


7. Three cards are drawn at random from a full deck. What is the probability 
of getting a three, a seven and an ace? 


8. What is the probability of being able to form a triangle from three segments 
chosen at random from five line segments of lengths 1, 3, 5, 7 and 9? 

Hint. A triangle cannot be formed if one segment is longer than the sum of 
the other two. 


9. Suppose a number from 1 to 1000 is selected at random. What is the proba- 
bility that the last two digits of its cube are both 1? 

Hint There is no need to look through a table of cubes. 

Ans. 0.01. 


10. Find the probability that a randomly selected positive integer will give a 
number ending in 1 if it is 

a) Squared; 

b) Raised to the fourth power; 

c) Multiplied by an arbitrary positive integer. 

Hint. It is enough to consider one-digit numbers. 

Ans. a) 0.2; b) 0.4; c) 0.04. 


11. One of the numbers 2, 4, 6, 7, 8, 11, 12 and 13 is chosen at random as the 
numerator of a fraction, and then one of the remaining numbers is chosen at 
random as the denominator of the fraction. What is the probability of the 
fraction being in lowest terms? 


12. The word “drawer” is spelled with six scrabble tiles. The tiles are then 
randomly rearranged. What is the probability of the rearranged tiles spelling 
the word “reward?” 

Ans. 3t5- 
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13, In throwing 6n dice, what is the probability of getting each face n times? 
Use Stirling's formula to estimate this probability for large n. 


14. A full deck of cards is divided in half at random, Use Stirling’s formula to 
estimate the probability that each half contains the same number of red and 
black cards. 
CCis 2 
Se YS © 0,22. 

oH V 267 
15. Use Stirling’s formula to estimate the probability that all 50 states are 
represented in a committee of 50 senators chosen at random. 


Ans. 


16. Suppose 2 customers stand in line at a box office, n with 5-dollar bills and 
n with 10-dollar bills. Suppose each ticket costs 5 dollars, and the box office 
has no money initially. What is the probability that none of the customers has 
to wait for change?!” 


17. Prove that 
n 
> (Cpt = cr, 
k=O 


Hint. Use the binomial theorem to calculate the coefficient of x" in the 
product (1 + x)"(1 + x)" = (1 + x)". 


2 A detailed solution is given in B. V. Gnedenko, The Theory of Probability, fourth 


edition (translated by B, D. Seckler), Chelsea Publishing Co., New York (1967), p. 43. 
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COMBINATION OF EVENTS 


3. Elementary Events. The Sample Space 


The mutually exclusive outcomes of a random experiment (like throwing 
a pair of dice) will be called elementary events (or sample points), and a 
typical elementary event will be denoted by the Greek letter w. The set of 
all elementary events « associated with a given experiment will be called the 
sample space (or space of elementary events), denoted by the Greek letter 
Q. An event A is said to be “associated with the elementary events of Q” if, 
given any w in Q, we can always decide whether or not w leads to the occur- 
rence of A, The same symbol A will be used to denote both the event A and 
the set of elementary events leading to the occurrence of 4. Clearly, an event 
A occurs if and only if one of the elementary events « in the set A occurs. 
Thus, instead of talking about the occurrence of the original event A, we can 
just as well talk about the “occurrence of an elementary event in the set 
A.” From now on, we will not distinguish between an event associated with 
a given experiment and the corresponding set of elementary events, it being 
understood that all our events are of the type described by saying ‘‘one of the 
elementary events in the set A occurs.” With this interpretation, events are 
nothing more or less than subsets of some underlying sample space 2, Thus 
the certain (or sure) event, which always occurs regardless of the outcome 
of the experiment, is formally identical with the whole space Q, while the 
impossible event is just the empty set @ , containing none of the elementary 
events «. 

Given two events A, and Ay, suppose A, occurs if and only if A, occurs. 
Then A, and A, are said to be identical (or equivalent), and we write A; = Ag. 


13 
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Example 1. In throwing a pair of dice, let A, be the event that “the 
total number of spots is even’’ and A, the event that ‘‘both dice turn up even 
or both dice turn up odd.”"? Then 4; = A). 


Example 2. In throwing three dice, let A, again be the event that “the 
total number of spots is even” and A, the event that “all three dice have either 
an even number of spots or an odd number of spots.’ Then A, + Ag. 


Two events A, and Ay, are said to be mutually exclusive or incompatible if 
the occurrence of one event precludes the occurrence of the other, i.e., if A; 
and A, cannot occur simultaneously. 

By the union of two events A, and Ag, denoted by A, U Ag, we mean the 
event consisting of the occurrence of at least one of the events A, and A. 
The union of several events 4,, Ao,... is defined in the same way, and is 
denoted by U A,. 


k 
By the intersection of two events A, and Ay, denoted by A, M A, or simply 
by 4,Az, we mean the event consisting of the occurrence of both events 4, 
and A,. By the intersection of several events A,, Ag,..., denoted by f) 4;, 
k 


we mean the event consisting of the occurrence of all the events A;, Ag,.. -. 

Given two events A, and Ag, by the difference A; — A, we mean the event 
in which A, occurs but not Ay. By the complementary event of an event A,* 
denoted by A, we mean the event “A does not occur.” Clearly, 


A=Q-A. 
Example 3. In throwing a pair of dice, let A be the event that “the 
total number of spots is even,” A, the event that “both dice turn up even,” 


and A, the event that “both dice turn up odd.” Then A, and A, are mutually 
exclusive, and clearly 


AHA Ad eee 


Let A, A, and A, be the events complementary to A, A; and Ag, respectively. 
Then Ais the event that “‘the total number of spots is odd,” A, the event that 
“at least one die turns up odd,” and A, the event that “at least one die turns 
up even.” It is easy to see that 


4,-A=A, NA=A, 4,-—A=A NASA 


The meaning of concepts like the union of two events, the intersection 
of two events, etc., is particularly clear if we think of events as sets of ele- 
mentary events w, in the way described above. With this interpretation, 


1 To “turn up even” means to show an even number of spots, and similarly for to “turn 
up odd.” 
? Synonymously, the “complement of A” or the “event complementary to 4.” 
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given events A,, A, and A, A; U A, is the union of the sets A, and Ag, 
A, A Ag is the intersection of the sets A, and Ay, A = Q — A is the comple- 
ment of the ser A relative to the whole space 2, and so on. Thus the symbols 
U, ©, etc. have their customary set-theoretic meaning. Moreover, the 
statement that “the occurrence of the event A, implies that of the event A,” 
(or simply, “A, implies A”) means that A, © Ag, i.e., that the set A, is a 
subset of the set A,.* 


poe 
G 

y 
y 


Ficure 2. (a) The events 4, and A, are mutually exclusive; 
(b) The unshaded figure represents the union A, U Ag; (c) The 
unshaded figure represents the intersection A, As; (d) The 
unshaded figure represents the difference A; — A,; (e) The shaded 
and unshaded events (A, and Ay) are complements of each other; 
(f) Event A, implies event A». 


To visualize relations between events, it is convenient to represent the 
sample space Q schematically by some plane region and the elementary 
events « by points in this region. Then events, i.e., sets of points w, become 
various plane figures. Thus Figure 2 shows various relations between two 
events A, and Ag, represented by circular disks lying inside a rectangle Q, 
schematically representing the whole sample space. In turn, this way of 
representing events in terms of plane figures can be used to deduce general 
relations between events, e.g., 


a) If A, © Ag, then 4, > AQ; 
Bed OM then dt = ARS 
c) If A= A, N Ap, then A= A, U A. 


* The symbol © means “‘is a subset of” or “‘is contained in,” while > means “contains.” 
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Quite generally, given a relation between various events, we can get an 
equivalent relation by changing events to their complements and the symbols 
A, U and ¢ to M, U and > (the sign = is left alone). 


Example 4, The following relations are equivalent: 
UES lakes, 
°k k 
NA,=B>NC,, 
k k 
UA,=B- UC,. 
k k 


Remark. It will henceforth be assumed that all events under considera- 
tion have well-defined probabilities. Moreover, it will be assumed that all 
events obtained from a given sequence of events A,, Ag, . . . by taking unions, 
intersections, differences and complements also have well-defined probabilities. 


4. The Addition Law for Probabilities 


Consider two mutually exclusive events A, and A, associated with the 
outcomes of some random experiment, and let A = A, U A, be the union of 
the two events. Suppose we repeat the experiment a large number of times, 
thereby producing a whole series of “independent trials under identical 
conditions.’ Let n be the total number of trials, and let n(A,), m(A.) and 
n(A) be the numbers of trials leading to the events A,, A, and A, respectively. 
If A occurs in a trial, then either A, occurs or A, occurs, but not both (since 
A, and A, are mutually exclusive). Therefore 


n(A) = n(A) + n(A,), 
and hence 


n(A) _ n(Ai) 


n n n 


But for sufficiently large n, the relative frequencies n(A)/n, m(A,)/n and 
n(A,)/n virtually coincide with the corresponding probabilities P(A), P(A,) 
and P(Aq), as discussed on p. 3. It follows that 


P(A) = P(A,) + P(A,). (2.1) 


Similarly, if the events A,, A, and A, are mutually exclusive, then so are 
A, U A, and Ag, and hence, by two applications of (2.1), 


P(A, U A, U As) = P(Ay U Ag) + P(As) = P(A,) + P(Ae) + P(As)- 
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More generally, given n mutually exclusive events A;, 4o,..., A,, we have 
the formula 


P( U Ay) = SP(A), (2.2) 
=: k=1 

obtained by applying (2.1) n — 1 times. Equation (2.2) is called the addition 
law for probabilities. 


Next we prove some key relations involving probabilities: 


THEOREM 2.1. The formulas 


0< P(A) <1, (2.3) 
P(A, — Ag) = P(A) — P(A, 2 Ay), (2.4) 
P(A, — Ay) = P(Ay) — P(A, 1 Ag), (2.5) 
P(A, VU Ay) = P(A) + P(A2) — P(A, A Ag) (2.6) 
hold for arbitrary events A, A, and Az, Moreover, 
P(A\) < P(A) if, A, © Ae (2.7) 


Proof. Formula (2.3) follows at once from the interpretation of 
probability as the limiting value of relative frequency, since obviously 


> 


0< MAK 
n 


where (A) is the number of occurrences of an event A inn trials. Given 
any two events A, and A,, we have 


A, = (Ay — Ay) U (AN A), 
Ay = (Az — Ay) U (A, 1 AQ), 
A, U Ay = (Ay — Ag) U (Ag — Ay) U (A, 29 Ad), 


where the events 4; — Ag, A, — A, and A, M Ay are mutually exclusive. 
Therefore, by (2.2), 


P(A,) = P(A, — Aq) + P(A, 2 Ad), (2.8) 
P(A2) = P(Az — Ay) + P(A, 1 Az), (2.9) 
P(A, U Ay) = P(A; — Az) + P(A — Ax) + P(Ar 9 A). (2.10) 


Formulas (2.8) and (2.9) are equivalent to (2.4) and (2.5), Then, using 
(2.4) and (2.5), we can write (2,10) in the form 


P(4y U Ay) = P(Ay) — P(A, 0 Ag) + P(A) 
—P(A, 0 A) + P(A, A Ap) 
= P(Ay) + P(4s) — P(A: 2 ,), 


4 Note that P(@) = 0, P(Q) = 1, since n(@) = 0, n(Q) = x for all n. Thus the impos- 
sible event has probability zero, while the certain event has probability one. 
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thereby proving (2.6). Finally to prove (2.7), we note that if A, © Ag, 
then Ay \ Ag = Ay, and hence (2.9) implies 


P(A,) = P(Aq) — P(Ag — Ay) < P(AQ), 
since P(A, — A,) > 0 by (2.3). 


The addition law (2.2) becomes much more complicated if we drop the 
requirement that the events be mutually exclusive: 


THEOREM 2,2 Given any n events Aj, Ay, ..., An, let® 

P, = > P(A), 
i=1 

Py = > P(A;A;) 
1<i<i<n 

Pe D3 P(A;A;A,), . - - 
1<i<j<ken 

Then 
P(UA,) = Pi — Pat P— Poteet Py (2.11) 
hol 


Proof. For n = 2, (2.11) reduces to formula (2.6), which we have 
already proved. Suppose (2.11) holds for any n — 1 events. Then 


P( U Ay) =3P(A)— > PAA) 
k=2 i=? 


2<i<j<n 


+ D> P(A,A,A,)—+°° (2.12) 
2<i<j<ken 
and 


(t AyAy) = SPA) — SPA) 


= 2 t<t<j<n 


+ > P(AA,A,A,) — +++. (2-13) 
2<i<j<kcn 


But, by (2.6), 
p(U A.) = P(A,) + r(U Ay) = p(U Ar), 


k=? 


and hence, by (2.12) and (2.13), 


p(U4,) =P) +S ray—_ 3 PAA) 
k= i=2 2<i<j<n 
n 
a DD PA AA) = > P(A14) 
2<i<j<k<n i=2 
oe >3 P(A,A,A,) —**' = Py — Po + Ps—"*", 
2<i<j<n 


5 4,A;, is shorthand for the intersection A; \ A, A;A;A, is shorthand for A; M Ay M Ax, 
and so on. Ina sum like Ss P(A4,A;A,), each group of indices (satisfying the indicated 
‘ 1<i<j<k<n 
inequalities) is encountered just once, 
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ie., (2.11) holds for any n events. The proof for all n now follows by 
mathematical induction. J 


Example (Coincidences). Suppose n students have n identical raincoats 
which they unwittingly hang on the same coat rack while attending class. 
After class, each student selects a raincoat at random, being unable to tell it 
apart from all the others. What is the probability that at least one raincoat 
ends up with its original owner? 


Solution. We number both the students and the raincoats from | to n, 
with the kth raincoat belonging to the kth student (k = 1,2,...,”). Let 
A, be the event that the kth student retrieves his own raincoat. Then the 
event A that “at least one raincoat ends up with its original owner”’ is just 


A=UA,. 
k=l 


Every outcome of the experiment consisting of ‘randomly selecting’ the 
raincoats can be described by a permutation (i,, ip,...,/,), where i, is the 
number of the raincoat selected by the Ath student. Consider the event 
A, Ax, *** Ax,» Where m <n. This event occurs whenever i,, = ky, i,, = 
ka, «++ 5 ty, = Km and the other indices take the remaining » — m values in 
any order. Therefore 


N(Aj, Ax,’ ** Ax,,) (1 — m)! 
N n!} 


P(A, Ap, * a Aun) > 
where N(A,,A,,*** Ay,,) = (n — m)! is just the total number of permuta- 
tions of n — m things, and N = n! is the total number of permutations of 
n things (m is the number of fixed indices k,, ky,...,k,,). There are pre- 
cisely 

— n! 


™m!(n— m)! 


distinct events of the type 4,,4,, °** A,,,, with m fixed indices, this being the 
number of combinations of n things taken m at a time (recall Theorem 1.3, 
p. 7). It follows that 


a ! 
a > Bk fhiyinnd a ated a 


1<ki<kg<"**<km<n n! m! 


Hence, by formula (2.11), 


P(WA,) = Pr— Pat Pa Peto k Py 
k=l 
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i.e., the desired probability P(A) is a partial sum of the power series expansion 
of the function 1 — e* with x = —1: 


A rete A 
ies acer whik ss ha mit 
Thus, for large n, 
P(A) © 1 — ew 0.632. (2.14) 


To generalize the addition law to the case of an infinite sequence of mutu- 
ally exclusive events A,, Ag,..., we repeatedly apply (2.1). Thus 


P(A, U A, U Ag U+**) = P(Ay) + P(A, U Ag U***) 
P(A;) + P(Aq) + P(A3 U-**) 
P(A,) + P(A,) + P(As) + °°", 


or equivalently, 


k=l 


p(O 4.) = 3 Pca. 
We can combine this formula and (2.2) into a single formula 
P(U A.) = 2 PUD, (22) 
k k 


where it will always be clear from the context whether U and > have finite 
or infinite limits.® * & 

The “generalized addition law” (2.2’) has a number of important con- 
sequences, We begin with two theorems expressing a kind of “continuity 
property”’ of probability: 


THEOREM. 2.3. If Aj, Ag,... is an “increasing sequence’ of events, 
i.e., a sequence such that A, < A, <***, then 
P(U Ay) =lim P(A,). (2.15) 
k n +0 


Proof. Clearly, the events 


n-1 
B,= A; By=A,—Aj,..., B,=A,—UB,.-- (16) 
k=1 


® In the last analysis, formulas (2.1), (2.2) and (2.3) are axioms, although they are, of 
course, strongly suggested by experience, i.e., by the interpretation of probabilities as 
limiting values of relative frequencies. In this sense, they are the “only reasonable axioms,” 
and lead to a model of random phenomena whose consequences are fully confirmed by 
experiment. 
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are mutually exclusive and have union LU) A,. Moreover, 
k 


* 
UB, = 4,. 
k=1 


Therefore, by (2.2’), 


P(U Ay) = P(U B,) = SPB) =lim SP(B,) 


F no kL 


ti e(U Bs) sitar 1 
k=1 


na n+0 
Similarly, we have 
THEOREM 2.3’. If Aj, Ag,... is a “decreasing sequence” of events, 
i.e., a sequence such that Ay > A, > +++, then 


P(N Ay) Em PCA): 


Proof. Going over to complementary events, we have 4, © 4, < ++:, 
and hence, by (2.15), 


P(N Ay) =e P(U 4) =1—lmP(A,) 


n> 


= lim [1 — P(4,)] =lim P(4,). I 


In the case of arbitrary events, we must replace = by < in (2.2’): 


THEOREM 2.4, The ineqaulity 
P(U Ay) < YP(A,) 
k k 
holds for arbitrary events A,, Ag, ... - 


Proof. As in the proof of Theorem 2.3, U A, is the union of the 


k 
mutually exclusive events (2.16), where obviously B, < A, and hence 
P(B,) < P(A,), by (2.7). Therefore 


P(U As) ae P(U By) = EPH) < TPA). | 


Finally, we prove a proposition that will be needed in Chapter 7: 


THEOREM 2.5 (First Borel-Cantellilemma).’ Given a sequence of events 


7 For the “second Borel-Cantelli lemma,” see Theorem 3.1, p: 33; 
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Ay, Ag,..., with probabilities p, = P(A,), k =1,2,..., suppose 


% 
2 Py < 0, (2.17) 


i.e., suppose the series on the left converges. Then, with probability 1 
only finitely many of the events Ay, Ag, ..., occur. 


Proof. Let B be the event that infinitely many of the events Aj, 

A,,... occur, and let 
B, =U A, 
k>n 
so that B,, is the event that at least one of the events A,, A, 1,-.- 
occurs, Clearly B occurs if and only if B, occurs for every n= 1, 
2,... Therefore 
B= 1B, =N (UA). 


n n \k>n 
Moreover, B, > B, > ++: , and hence, by Theorem 2.3’, 
P(B) =lim P(B,,). 


n> 0 


But, by Theorem 2.4, 
P(B,) < > P(A,) = > > O0 as n— ow, 
k>n k>n 
because of (2.17). Therefore 
P(B) =lim P(B,) = 0, 

i.e., the probability of infinitely many of the events A, Ag,..- occurring 
is 0. Equivalently, the probability of only finitely many of the events 
A,, Ag,... occurring is 1. 


PROBLEMS 
1, Interpret the following relations involving events A, B and C: 
a) AB=A; b) ABC=A; c) AUBUC=A. 


2. When do the following relations involving the events A and B hold: 
a) AUB=A; b) AB=A; c) AUB=AB? 


3. Simplify the following expressions involving events A, B and C: 
a) (A UB\(BUC); b) (A UB)(A UB); c) (A UB)(A VU BVA UB). 
Ans. a) ACU B; b) A; c) AB. 


4. Given two events A and B, find the event ¥ such that 


(X¥ UA) U(X UA) =B. 
Ans. X= 8B: 
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5. Let A be the event that at least one of three inspected items is defective, and B 
the event that all three items are of acceptable quality. What are the events 
A VU Band AB? 


6. A whole number from 1 to 1000 is chosen at random. Let A be the event that 
the number is divisible by 5, and B the event that the number ends in a zero, 
What is the event AB? 


7. A target is made up of 10 circular disks bounded by 10 concentric circles of 
radii ry, ro,..., 9 wherery < rp < +++ <?o. Let A; be the event consisting of 
the disk of radius r;, being hit (k = 1, 2,..., 10). What are the events 


6 10 
B=U4, C= 4? 
kel ka 
Ans, B= Ag, C = As. 
8. Given any event A, prove that 
P(A) =1—P(A4), P(A) = 1 — P(A). 
9, A marksman fires at a target made up of a central circular disk and two 


concentric rings. The probabilities of hitting the disk and the rings are 0.35, 
0.30 and 0.25, respectively. What is the probability of missing the target? 


10. Five items are chosen at random from a batch of 100 items and then in- 
spected. The whole batch is rejected if any of the items is found to be defective. 
Whatis the probability of the batch being rejected if it contains 5 defective items? 


1 95-94-93 -92-91 

~ 100 + 99-98-97 - 96 
11. A secretary forgets the last digit of a telephone number, and dials the last 
digit at random. What is the probability of calling no more than three wrong 
numbers? How is this probability changed if she recalls that the last digit is 
even? 


Ans. w 0.23, 


12, Given any n events Ay, Ag,..., An, prove that 
n n 
( 14) =1 -P(U 4). 
k=l k=l 


13. A batch of 100 manufactured items contains 5 defective items. Fifty items 
are chosen at random and then inspected. Suppose the whole batch is accepted if 
no more than one of the 50 inspected items is defective. What is the probability 
of accepting the whole batch? 

47-37 
Ans. 99-97 ~ 0.18. 
14. Write an expression for the probability p(r) that among r randomly selected 
people, at least two have a common birthday. 


Comment. Rather surprisingly, it turns out that p(r) > 4 if r = 23.8 


® See W. Feller, op. cit., p. 33. 


24 


COMBINATION OF EVENTS CHAP. 2 


15. Test the approximation (2.14) for n = 3, 4, 5 and 6. 


16. Use Theorem 2.2 and Stirling’s formula to find the probability that some 
player is dealt a complete suit in a game of bridge. 


ws lS 2 72 3° Bon ve 

eae ees oh re x 4 
“Gh Caen CacReR "a ** 
17. Given any n events A,, 4z,..., An, prove that the probability of exactly 
m (m <n) of the events occurring is 

m+1 m+2 n 
Pn ~ ( mn Pass + ( = rms — 7+ (TP 

where P,, Pmyi,... are the same as in Theorem 2.2. 


18. Let n = 10 in the example on p. 19. What is the probability that exactly 
5 raincoats end up with their original owners? 


19. A whole number from 1 to 1000 is chosen at random. What is the proba- 
bility of its being a power (higher than the first) of another whole number? 
Hint. 31? < 1000 < 32?. 
Ans. 5. 
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5. Conditional Probability 


In observing the outcomes of a random experiment, one is often inter- 
ested in how the outcome of one event A is influenced by that of another 
event B, For example, in one extreme case the relation between A and B may 
be such that A always occurs if B does, while in the other extreme case A never 
occurs if B does. To characterize the relation between A and B, we introduce 
the conditional probability of A on the hypothesis B, i.c., the “probability of 
A occurring under the condition that B is known to have occurred.” This 
quantity is defined by 
P(AB) 
P(B) ” 


where AB is the intersection of the events A and B, and it is assumed that 
P(B) > 0. 

To clarify the meaning of (3.1), consider an experiment with a finite 
number of equiprobable outcomes (elementary events). Let N be the total 
number of outcomes, N(B) the number of outcomes leading to the occurrence 
of the event B, and N(AB) the number of outcomes leading to the occurrence 
of both A and B. Then, as on p. 1, the probabilities of B and AB are just 


P(A | B) = G.1) 


_ NB) _ N(AB) 
AU re SR ae (3.2) 
and hence (3.1) implies 
__ NAB) 
P(A | B) = Ne : (3.3) 
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But (3.3) is of the same form as (3.2), if we restrict the set of possible out- 
comes to those in which B is known to have occurred, In fact, the denomin- 
ator in (3,3) is the total number of such outcomes, while the numerator is the 
total number of such outcomes leading to the occurrence of A. 

It is easy to see that conditional probabilities have properties analogous 
to those of ordinary probabilities. For example, 


a) 0< P(A|B) <1; 

b) If A and B are incompatible, so that AB = @, then P(A | B)= 

c) If Biriplies A, so that B < A, then P(A | B) = 1; 

d) If A,, Ay,... are mutually exclusive events with union A = U Ay, 
then by 


P(A | B) = > P(A, | B) (3.4) 
k 
(the addition law for conditional probabilities). 


Property a) is an immediate consequence of (3.1) and the formula 0 < 
P(AB) < P(B), implied by @ ¢ ABC B. To prove b), we note that 
AB= @ implies P(AB) = 0 and hence P(A | B) = 0, by (3.1). Similarly, 
c) follows from the observation that if B < A, then AB = B, P(AB) = P(B), 
and hence P(A | B) = 1, by (3.1). Finally, if A = U A,, Where A;, Ag, ... 
are mutually exclusive events, then 


AB =U A,B, 
k 


and hence 
P(AB) =  P(A,B), (3.5) 
k 


by formula (2.2’), p. 20, the addition law for ordinary probabilities. Dividing 
(3.5) by P(B), we get (3.4), because of (3.1) and 


P(A;B) 


P(A, | B) = Pe)” 


In calculating the probability of an event A, it is often convenient to use 
conditional probabilities as an intermediate step. Suppose B,, By,... is a 
“full set’! of mutually exclusive events, in the sense that one (and only one) 
of the events B,, B,,... always occurs. Then we can find P(A) by using the 
“total probability formula” 


P(A) = ZPA | B,)P(B,). (3.6) 


* Synonymously, an “exhaustive set.” 
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To prove (3.6), we need only note that 


UB, =Q, 
k 
where Q is the whole sample space, since one of the events B,, By... must 
occur. But then 
A=U 4B,, 
k 


and hence 


= = _ y PAB.) 
P(A) = P(U AB;) = 3 PUB) = Sot POD 


which is equivalent to (3.6). 


Example 1. A hiker leaves the point O shown in Figure 3, choosing one 
of the roads OB,, OB,, OB;, OB, at random, At each subsequent crossroads 
he again chooses a road at random. What is the probability of the hiker 
arriving at the point A? 


Figure 3 


Solution. Let the event that the hiker passes through the point B,, k = 
1,...,4, be denoted by the same symbol B, as the point itself. Then B,, By, 
B;, B, form a “full set’ of mutually exclusive events, since the hiker must 
pass through one of these points. Moreover, the events B,, By, Bs, By are 
equiprobable, since, by hypothesis, the hiker initially makes a completely 
random choice of one of the roads OB,, OB,, OB;, OB,. Therefore 

1 
P(B) = 3» k=1,...,4. 
Once having arrived at B,, the hiker can proceed to A only by making the 
proper choice of one of three equiprobable roads. Hence the conditional 
probability of arriving at A starting from B, is just }. Let the event that the 
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hiker arrives at A be denoted by the same symbol A as the point itself. Then 


1 
P(A | By) = 5. 


and similarly 
1 2 
P(A| B)=>5, P(A | B,) =1, P(A | B) == 
(consult the figure). It follows from (3.6) that the probability of arriving at 
Ais 
P(A) = P(A | B,)P(B,) + P(A | B,)P(B,) 
+ P(A | By)P(B,) + P(A | B)PCB,) 


1 ( ee | 1 3) 67 
“433, 

Example 2. (The optimal choice problem). Consider a set of m objects, 
all of different quality, such that it is always possible to tell which of a given 
pair of objects is better. Suppose the objects are presented one at a time and 
at random to an observer, who at each stage either selects the object, thereby 
designating it as “the best’’ and examining no more objects, or rejects the 
object once and for all and examines another one. (Of course, the observer 
may very well make the mistake of rejecting the best object in the vain hope 
of finding a better one!) For example, the observer may be a fussy young lady 
and the objects a succession of m suitors. At each stage, she can either accept 
the suitor’s proposal of marriage, thereby terminating the process of selecting 
a husband, or she may reject him (thereby losing him forever) and wait for a 
better prospect to come along. It will further be assumed that the observer 
adopts the following natural rule for selecting the best object: “Never select 
an object inferior to those previously rejected.’’ Then the observer can select 
the first object and stop looking for a better one, or he can reject the first 
object and examine further objects one at a time until he finds one better 
than those previously examined. He can then select this object, thereby 
terminating the inspection process, or he can examine further objects in the 
hope of eventually finding a still better one, and so on. Of course, it is 
entirely possible that he will reject the very best object somewhere along the 
line, and hence never be able to make a selection at all. On the other hand, 
if the number of objects is large, almost anyone would reject the first object 
in the hope of eventually finding a better one. 

Now suppose the observer, following the above ‘‘decision rule,” selects 
the ith inspected object once and for all, giving up further inspection. (The 
ith object must then be better than the i — 1 previously inspected objects.) 
What is the probability that this ith object is actually the best of all m objects, 
both inspected and uninspected ? 
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Solution. Let B be the event that the last of the 7 inspected objects is the 
best of those inspected, and let A be the event that the ith object is the best 
of all m objects, both inspected and uninspected. Then we want the condi- 
tional probability P(A | B) of the event A given that B has already occurred. 
According to (3.1), to calculate P(A | B) we need both P(B) and P(AB). 
Obviously A < Band hence AB = 4, so that P(AB) = P(A). By hypothesis, 
all possible arrangements of the objects in order of presentation are equi- 
probable (the objects are presented “‘at random”). Hence P(B) is the proba- 
bility that in a random permutation of / distinguishable objects (the objects 
differ in quality) a given object (the best of all i objects) occupies the ith 
place. Since there are i! permutations of all i objects and (i — 1)! permuta- 
tions subject to the condition that a given object occupy the ith place, this 
probability is just 


P(B) = 


Gia 
i! ie 


Similarly, P(A) is the probability that in a random permutation of m dis- 
tinguishable objects, a given object (the best of all m objects) occupies the 
ith place, and hence 


pa) = =I 1 
m! m 


Therefore the desired conditional probability P(A | B) is just 


~ P(B) P(B) m- 


Example 3 (The gambler’s ruin). Consider the game of “heads or tails,” 
in which a coin is tossed and a player wins 1 dollar, say, if he successfully 
calls the side of the coin which lands upward, but otherwise loses 1 dollar. 
Suppose the player's initial capital is x dollars, and he intends to play until 
he wins m dollars but no longer. In other words, suppose the game continues 
until the player either wins the amount of m dollars, stipulated in advance, 
or else loses all his capital and is “ruined.” What is the probability that the 
player will be ruined? 


Solution. The probability of ruin clearly depends on both the initial 
capital x and the final amount m. Let p(x) be the probability of the player’s 
being ruined if he starts with a capital of x dollars. Then the probability of 
ruin given that the player wins the first call is just p(x + 1), since the player’s 
capital becomes x + 1 if he wins the first call. Similarly, the probability of 
ruin given that the player loses the first call is p(x — 1), since the player’ s 
capital becomes x — 1 if he loses the first call. In other words, if B, is the 
event that the player wins the first call and B, the event that he loses the first 
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call, while A is the event of ruin, then 
P(A| By) = p(xt+1), P(A B,) = p@ — 1). 


The mutually exclusive events B, and B, form a “full set,” since the player 
either wins or loses the first call. Moreover, we have 


1 1 
P(B,) = 2’ P(B;) = 2 ? 


assuming fair tosses of an unbiased coin (cf. Problem 1, p. 65). Hence, by 
(3.6), 

J P(A) = P(A | B,)P(B;) + P(A| By)P(B2), 

Ler 


PO) = 5 [Pe + N+pe—-—)), 1<x<m-—t, (3.7) 


where obviously 
p(0)=1, —p(m) = 0. = (3.8) 


The solution of (3.7) is the linear function 
P(x) = Cy + Cx, (3.9) 


where the coefficients C, and C, are determined by the boundary conditions 
(3.8), which imply 
C=1, C+Cm=0. (3.10) 


Combining (3.9) and (3.10), we finally find that the probability of ruin given 
an initial capital of x dollars is just 


Hej O<x<m. 
m 


6. Statistical Independence 


In saying that two experiments are “statistically independent” (or briefly, 
“independent’’), we mean, roughly speaking, that the outcome of one experi- 
ment has no influence on the outcome of the other. Let A, be an event 
associated only with the first experiment, and A, an event associated only with 
the second experiment. Then the occurrence of A, has no influence on the 
probability of occurrence of As, and conversely. In this sense, we say that the 
events 4, and A, are “‘(statistically) independent.” 

To give mathematical expression to the notion of independence, we 
calculate the probability that two independent events 4, and A, both occur. 
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To this end, we again resort to the empirical fact that the relative frequency 
of an event in a large series of “independent trials under identical conditions’? 
virtually coincides with its probability (recall Sec, 1). Imagine a long series 
of such trials, where each trial involves carrying out both experiments, If n 
is the total number of trials and n(A,A,) the number of trials leading to 
occurrence of both A, and Ag, then 


P(A, Ay) ~ 


dere (3.11) 
n 


Moreover, if (Ag) is the number of trials leading to occurrence of Ag, then 


P(A, ~ Ad, (3.12) 
n 

Suppose we confine ourselves to examining the results of the n(A,) trials 
leading to occurrence of Ay, and look for occurrence of A,. Then clearly A; 
will occur in precisely n(A;Ay) of these trials. Moreover, ifn is very large, then 
so is n(Ag), and hence 

P(A) ~ Madd) | (3.13) 

n(Ag) 

since A, is associated only with the second experiment, which has nothing 
whatsoever to do with the first experiment or the event A, associated with it. 
Combining (3.11)-(3.13), we find that 


n(AjAg) = n(A, Aq) n(Ag) 


= P(A,A2) ‘fond n(A ) 
2 


~ P(A,)P(A2), 


or, after going over to exact equations (in the limit as n — 00), 
P(A,Ay) = P(A,)P(A,). (3.14) 


Two events A, and A, are said to be (statistically) independent if they satisfy 
(3.14) and (statistically) dependent otherwise.* 

The definition (3.14) is in keeping with the notion of conditional proba- 
bility introduced in Sec. 5. In fact, if two events A, and A, are independent, 
then, loosely speaking, the occurrence of A, should have no influence on the 
probability of occurrence of A,, and hence the conditional probability 


* Thus there remains the problem of just what is meant by “independent trials under 
identical conditions” (a phrase already encountered on pp. 2 and 16), although the in- 
tuitive meaning of the phrase is perfectly clear, e.g., in a series of coin tosses. For a rigorous 
discussion of this whole issue, see W. Feller, op. cit., p. 128. 

8 In the last analysis, (3.14) is a definition, although one strongly suggested by experience, 
i.e., by the intuitive meaning of independence and the interpretation of probabilities as 
limiting values of relative frequencies (recall footnote 6, p. 20). 
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P(A, | Ag) of A, occurring given that A, has already occurred should be the 
same as the unconditional probability of A,, i.e., 


P(A, | Ay) = P(A)) 
(and similarly with A, and A, changing places), But clearly 


P(A,A3) 


P(A, | 42) = P(Ay) 
2. 


= P(A) 
if and only if (3.14) holds. 


Example 1, Let A, be the event that a card picked at random from a 
full deck is a spade, and Ag, the event that it is a queen. Are A, and A, 
independent events? 


Solution. The question is not easily answered on the basis of physical 
intuition alone. However, noting that a full deck (52 cards) contains 13 
spades and 4 queens, but only one queen of spades, we see at once that 


ie ek a | 4 


1 
P4y= 5-7 Fa=a= 


1 
4 13” P(A,Ay) = 


52” 


and hence P(A,A,) = P(A,)P(A;). Therefore the events A, and Ay, are inde- 
pendent. 


Example 2. In throwing a pair of dice, let A, be the event that “the 
first die turns up odd,” A, the event that “the second die turns up odd,” and 
Ag the event that “the total number of spots is odd.”’ Clearly, the number of 
spots on one die has nothing to do with the number of spots on the other die, 
and hence the events 4; and A, are independent, with probabilities 


1 1 
PA) =5, P(A) =5- 


Moreover, it is clear that 


1 
P(As) = 5. 


Given that A, has occurred, A, can occur only if the second die turns up even. 
Hence 
1 
P(A; | A) == 
(43| 4) =5, 
and similarly 


1 
P(4,| 42) = 5. 
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It follows that 
P(A; | A) = P(A3), P(As | Ay) = P(As). 


Therefore the events A, and A, are independent, and so are the events A, 
and A. 


Generalizing (3.14), we have the following 


DEFINITION. The events A,, Ao,..., A, are said to be (mutually) 
independent if 
P(A,A,) = P(A,)P(A)), 


P(A,4,A,) = P(A)P(A))P(A,), 


P(A\Aq* ++ A,) = P(Ay)P(Aa) +++ P(An) 
for all combinations of indices such that! <i <j <***<k <n 


Example 3. The events A, A, and A, in Example 2 are not independent, 
even though they are “pairwise independent” in the sense that 


P(A,A;) = P(A))P(A,) 


for all 1 < i <j < 3. Im fact, A, obviously cannot occur if 4, and A, both 
occur, anc-hence 
P(A,AzA3) = 0. 
But 
P(A,)P(A2)P(As) — 2 i 3 £ 2 4 8 ’ 


so that 
P(A,A2A3) & P(A,)P(Az)P(As)- 


Given an infinite sequence of events A,, Ay,..., Suppose the events 
A,,...,A,, are independent for every 7. Then Aj, Ag,... is said to be a 
sequence of independent events. 


THEOREM 3.1 (Second Borel-Cantelli lemma). Given a sequence of 
independent events Ay, Ay,..., with probabilities py = P(A,), k = 1, 
2,..., Suppose 


Dm = ~, (3.15) 
k=1 


ie., suppose the series on the left diverges. Then, with probability 1 
infinitely many of the events A,, Ag, ... occur. 
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Proof. As in the proof of the first Borel-Cantelli lemma (Theorem 
2,5, p. 21), let 


B,=UA,, B=1B,= (UA), 


n \k>n 


so that B occurs if and only if infinitely many of the events A,, Ao,... 
occur. Taking complements, we have 


B04, 8=Us, 


Clearly, ge 
nim 
B,< NA, 
k=n 
for every m = 0, 1, 2,... Therefore 


P(B,) < P(A) = PUA) Pha) 


k=n 


nm 


= (1 — py)="* (1 — Pram) < EXP = ms), (3.16) 


where we use the inequality 1 — x < e*, x > Oand the fact that if 4,, 


Ag,... is a sequence of independent events, then so is the sequence 
of complementary events 4, Ay,...4 But 

n+m 

TRO as m>o, 

k=n 


because of (3.15), Therefore, passing to the limit m— co in (3.16), 
we find that P(B,,) = 0 for every n = 1, 2,... It follows that 


P(B) < > P(B,) = 0, 


and hence 
P(B) = 1 — P(B) = 1, 


ive., the probability of infinitely many of the events Aj, Ag, . . . occurring 
isl. J 


PROBLEMS 


1. Given any events A and B, prove that the events A, ABand A U B forma 
full set of mutually exclusive events. 


‘It is intuitively clear that if the events A,,..., An are independent, then so are their 
complements. Concerning the rigorous proof of this fact, see Problem 7 and W. Feller, 
op. cit., pp. 126, 128. 
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2. In a game of chess, let A be the event that White wins and B the event that 
Black wins. What is the event C such that A, Band C forma full set of mutually 
exclusive events ? 


3. Prove that if P(A | B) > P(A), then P(B| A) > P(B). 
4. Prove that if P(A) = P(B) = 3, then P(A| B) > 4. 
5. Given any three events 4, B and C, prove that 
P(ABC) = P(A)P(B| A)P(C| AB). 
Generalize this formula to the case of any n events. 
6. Verify that 
; P(A) = P(A| B) + P(A| B) 
‘ a) A=; b) B=; c)B=9; d) B=4; e) B= A. 
7. Prove that if the events A and B are independent, then so are their comple- 


ments. 
Hint. Clearly P(B| A) + P(B| A) = 1 for arbitrary A and B. Moreover 


P(B| A) = P(B), by hypothesis. Therefore P(B| A) = 1 — P(B) = P(B), so 
that A and B are independent. 

8. Two events A and B with positive probabilities are incompatible. Are 
they dependent? 

9. Consider n urns, each containing w white balls and b black balls. A ball is 
drawn at random from the first urn and put into the second urn, then a ball is 


drawn at random from the second urn and put into the third urn, and so on, 
until finally a ball is drawn from the last urn and examined, What is the prob- 


ability of this ball being white? 
w 
w+b° 


10. In Example 1, p. 27, find the probability of the hiker arriving at each of 
the 6 destinations other than A. Verify that the sum of the probabilities of 
arriving at all possible destinations is 1. 


Ans. 


11. Prove that the probability of ruin in Example 3, p. 29 does not change if 
the stakes are changed. 


12. Prove that the events A and B are independent if P(B| 4) = P(B| A). 
13. One urn contains w, white balls and 6, black balls, while another urn 


contains w, white balls and b, black balls. A ball is drawn at random from each 
urn, and then one of the two balls so obtained is chosen at random. What is the 


probability of this ball being white? 


y 1/owy & =| 
mS Nw, +b, We + be) 
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14, Nine out of 10 urns contain 2 white balls and 2 black balls each, while the 
other urn contains 5 white balls and 1 black ball. A ball drawn from a randomly 
chosen urn turns out to be white. What is the probability that the ball came from 
the urn containing 5 white balls? 

Hint. If By,..., B, is a full set of mutually exclusive events, then 


P(B,)P(A| By) _ _P(B,)P(A | Bi) 


P(B,| A) = 5 ’ 
Re) SPBYPCA| By) 
k=l 


a formula known as Bayes’ rule. The events B,,..., B,, are often regarded as 
“hypotheses” accounting for the occurrence of A. 


5. 
Ans. 35. 


15. One urn contains only white balls, while another urn contains 30 white 
balls and 10 black balls. An urn is selected at random, and then a ball is drawn 
(at random) from the urn. The ball turns out to be white, and is then put back 
into the urn. What is the probability that another ball drawn from the same urn 
will be black? 


Ans. #5. 


16. Two balls are drawn from an urn containing » balls numbered from 1 to n. 
The first ball is kept if it is numbered 1, and returned to the urn otherwise. What 
is the probability of the second ball being numbered 2? 


n—nt+1 
m(n — 1) ~ 


17. A regular tetrahedron is made into an unbiased die, by labelling the four 
faces a, b, c and abc, respectively. Let A be the event that the die falls on either 
of the two faces bearing the letter a, B the event that it falls on either of the two 
faces bearing the letter b, and C the event that it falls on either of the two faces 
bearing the letter c. Prove that the events A, Band Care “pairwise independent” 
but not independent. 


Ans. 


18. An urn contains w white balls, b black balls and r red balls. Find the 
probability of a white ball being drawn before a black ball if 


a) Each ball is replaced after being drawn; 
b) No balls are replaced. 


Ans. in both cases. 


w 
ae 


5 As defined in Example 3, p. 33. 
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RANDOM VARIABLES 


7. Discrete and Continuous Random Variables. 
Distribution Functions 


Given a sample space ©, by a random variable we mean a numerical 
function & = E(«) whose value depends on the elementary events w € Q. 
Let P{x’ < & < x"} be the probability of the event {x’ < & < x"}, ie., the 
probability that & takes a value in the interval x’ < x < x”. Then knowledge 
of P{x’ < & < x"} for all x’ and x” (x’ < x") is said to specify the proba- 
bility distribution of the random variable g. 

A random variable £ = &(w) is said to be discrete (or to have a discrete 
distribution) if & takes only a finite or countably infinite number of distinct 
values x, with corresponding probabilities 


P,(x) = P{E = x}, 


SPQ =1, 


where the summation is over all the values of x taken by &. For such random 
variables, 


Pix’ < B< x'} = SPA, (4.1) 


where the summation is over the finite or countably infinite number of values 
of x which & can take in the interval x’ < x < x". 

A random variable = &(w) is said to be continuous (or to have a 
continuous distribution) if 


Pe’ <6 < x"} =|" ro) dx, (4.2) 
37 
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where p,(x) is a nonnegative integrable function, called the probability 
density of the random variable &, with unit integral 


Pn ax = 1. 
Clearly, if € is a continuous random variable, then 


P(E = x} =0 
for any given value x, while? 


P{E € dx} ~ p(x) dx 


for every x with a neighborhood in which the probability density p;,(x) is 
continuous. Here P{& € dx} is the probability of the event {€ € dx}, con- 
sisting of & taking any value in an infinitesimal interval dx centered at the 
point x. 
The function 
®,(x) = P{E < x}, =O <x <0 


is called the distribution function of the random variable &. If & is a discrete 
random variable, ®,(x) is the step function 


(9) = ¥ px), 


taking a finite or countably infinite number of distinct values [the graph of 
such a function is shown in Figure 4(a)]. If & is a continuous random 
variable, ®,(x) is the continuous function 


,(x) =] px(x) dx (4.3) 
[the graph of such a function is shown in Figure 4(b)].? Clearly, 
P{x' <& < x"} = Ox") — O,(x’) (4.4) 


for any random variable &. 

Now consider two random variables €, and &», or equivalently, the ran- 
dom point or vector & = (&;, &). First suppose &, and & are discrete. Then 
=, and & have a joint probability distribution, characterized by the proba- 
bilities 

Pey,eq(%1) X2) = PE, = x1, Ep = Xo}, (4.5) 


where x, and x, range over all possible values of the corresponding random 


1 The symbol € means “‘belongs to” or “is contained in.” 

* By a well-known theorem on differentiation, d®.(x)/dx = p,(x) almost everywhere. 
See e.g., E. C. Titchmarsh, The Theory of Functions, second edition, Oxford University 
Press, London (1939), p. 362. 
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-x 


i 


0 7 
(b) 
Ficure4. (a) A typical distribution function of a discrete random 
variable taking only the integral values ... , — Zt OF, 25% 2% 
At the points x =...,—2, —1,0, 1,2,..., (x) has jumps 


equal to the corresponding probabilities P,(x). (b) A typical 
distribution function of a continuous random variable, Any 
continuous monotonic function D_(x) such that lim ®,(x) = 0, 


2-0 
lim (x) =1 can serve as the distribution function of a 

r+ 

continuous random variable §.* 


variables £, and &. The probability of any event of the type {(E1, &2) € B}, 
ive., the “probability of the random point = (&, &:) falling in a given set 
B,’’ is given by 
P{(E1, Es) € B} = » Pey,84(%1» X2)s 
(wy ,7)eB 


where the summation is over all possible values x,, x2 of the random variables 
&,, & such that the point (x,, x2) lies in B. Next suppose &, and &, are con- 
tinuous. Then by the joint probability density of =, and &, we mean a 


Jt should be noted that there are random variables which are neither discrete nor 
continuous but a “mixture of both.” There are also continuous distribution functions more 
general than (4.3). 
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function pz, :,(%1, x2) of two variables x, and x, such that the probability of 
any event of the type {(&1, &2) € B} is given by 


P{(Ex, Es) € BY =| | Dg,.e40ts 2) dx dp (4.6) 
B 
(the integral is over B). 
Given a family of random variables &,...,&,, suppose the events 
{x < &, < xy}, K=1,...,m are independent for arbitrary x, and x; 


(x, < x,). Then the random variables &,,... , &,, are said to be (statistically) 
independent. Given an infinite sequence of random variables &,, &,..., 


suppose the random variables %,,..., &, are independent for every n, or 
equivalently that the events {x, < &, < x;}, kK =1,2,... are independent 
for arbitrary xj, and x;. Then &), &,... is said to be a sequence of independent 


random variables. 
Suppose two random variables €, and &, are independent. Then clearly 
their joint probability distribution (4.5) is such that 


Pig, ay Xe) = Pe, (1) Pp, (2) (4.7) 
if €, and &, are discrete, and 
Pe, .8(X1» X2) = Pe,(%1)Pz,(X2) (4.7) 


if €, and &, are continuous. In (4.7’), p;,(x,) is the probability density of E, 
and p;,(x2) that of 2, while p;, :,(x1, x2) is the joint probability density of E; 
and &, figuring in (4.6). 


Example 1 (The uniform distribution). Suppose a point ~ is “tossed at 
random” onto the interval [a, b]. This means that the probability of & falling 
in a subinterval [x’, x"] < [a, 6] does not depend on the location of [x’, x”]. 
Hence the probability of & falling in [x’, x"] is proportional to the length 


” 


x” — x'.4 More exactly, we have 


Te a) ee each 
b-—a “b—a 


since then the probability of & falling in [a, 6] itself is 
dx 


Pa<b< b= [= 1, 


as it must be. Clearly, & is a continuous random variable, with probability 


‘Let f(s) be the probability of & falling in a subinterval of length s. Then clearly 
S(s + t)=f(s) + f(t). But it can be shown that any function f(s) satisfying this equation 
is either of the form ks (k a constant) or else unbounded in every interval (see W. Feller, 
op. cit., p. 459). 
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density 


if a<x<b, 
p(x) = \b—a 
0 if ig or xb. 
Such a random variable is said to have a uniform distribution. 


Example 2. Suppose two points 2, and & are tossed at random and 
independently onto a line segment of length L. What is the probability that 
the distance between the two points does not exceed /? 


Solution. Imagine that &, falls in an interval [0, L] of the x-axis, while 
€, falls in an interval [0, L] of the x-axis, perpendicular to the x,-axis as in 
Figure 5. Then the desired probability is just the probability that a point 
& = (&,, &,) tossed at random onto the square 0 < x1, x, < L will fall in the 


42 


Figure 5 


region B bounded by the lines x, =/-+ x, and x, = —/ + xy (B is the 
unshaded region in Figure 5).° By hypothesis, the random variables ©, and &, 
are independent and are both uniformly distributed in [0, L], ie., both have 
probability density 


1 
eas O<x<L. 
p(x) L 


Hence, by (4.6), the joint probability density of the independent random 
variables @, and & is just 


1 
Pey.g(%1 X2) = LP’ Gis epes 


5 Note that |, — &,| is the horizontal distance between the point (&,, &,) and the line 
X, = x. This is the distance p shown in Figure 5, from which it is apparent that ¢ </ if 
and only if (&1, &:) lies in B. 
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Therefore the probability of the random point & = (&, &) falling in the 
region B is given by 


r oat 
P{(E,, &2) € B} =| dvds MT 
B 


since L? — 2+ }(L — |)? = 2L/ — F* is the area of B (the square minus the 
two shaded triangles). 


Example 3 (Buffon’s needle problem). Suppose a needle is tossed at random 
onto a plane ruled with parallel lines a distance L apart, where by a “needle” 
we mean a line segment of length / < L. What is the probability of the needle 
intersecting one of the parallel lines? 


Solution. Let & be the angle between the needle and the direction of the 
rulings, and let & be the distance between the bottom point of the needle 
and the nearest line above this point [see Figure 6(a)]. Then the conditions 


Ficure 6 


of the “needle tossing experiment” are such that the random variable &, is 
uniformly distributed in the interval [0, x], while the random variable @, is 
uniformly distributed in the interval [0, L]. Hence, assuming that the random 
variables &, and &, are independent, we find that their joint probability density 
is 


1 
Pee(%1%2)=—, O< xy <7, 0<x,< L. 
nL 


The event consisting of the needle intersecting one of the rulings occurs if 
and only if 

& < /sin &, 
i.e., if and only if the corresponding point & = (&, &») falls in the region B, 
where B is the part of the rectangle 0 < x, < 7, 0 < x, < L lying between 
the x,-axis and the curve x, = sin x, [B is the unshaded region in Figure 
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6(b)]. Hence, by the general formula (4.6), 


C dd 2 
Piguweay =) (es. 2 (4.8) 
re TL 


where 


1 "sin x dx; = 21 
is the area of B. 

In deducing (4.8), we have assumed that ; and & are independent 
random variables. This assumption can be tested experimentally. In fact, 
according to (4.8), if the needle is repeatedly tossed onto the ruled plane, 
then the frequency of the event A, consisting of the needle intersecting one of 
the rulings, must be approximately 2//L. Suppose the needle is tossed n 
times, and let n(A) be the number of times A occurs, so that n(A)/n is the 
relative frequency of the event 4. Then 


n(A) 21 
n nL 
for large n, as discussed on p. 3. Hence 


should be a good approximation to 7 = 3.14... for large n. This actually 
turns out to be the case.* 


Example 4. Given two independent random variables &, and &,, with 
probability densities p;,(x,) and p;,(X2), find the probability density of the 
random variable 

= a te Ee. 


Solution. By (4.7’), the joint probability distribution of €; and &, equals 
Pz,(1)Pe,(X2), and hence, by (4.6), 
Ply’ << y’} => ff De,(X1)Peg(X2) dx, dx, 
a 
= isa Ps) — X)Pe,() dx] dy. 


Therefore the probability density of the random variable 7 is given by the 
expression 


Pov) =f" Pel — *)Ps,@) ax, 
called the composition or convolution of the functions p;, and p;,. 


® See J. V. Uspensky, Introduction to Mathematical Probability, McGraw-Hill Book 
Co., Inc., New York (1937), p. 113. 
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For example, suppose &, and , are both uniformly distributed in the 
interval [0, 1], so that they both have the probability density 
1 if O<x<1, 
BO) = : 
0 if SOM onex a: 
Then 


fidx=y if 0O< pad 
0 
p(y) = [ emta Iho Treciyren oy 


0 if y<0) or yiee2s 
The graph of the density p,(y) is triangular in shape, as shown in Figure 7. 


Py\y) 


| 
| 
! 
i 
| 
j re 


FIGURE 7 


8. Mathematical Expectation 


By the mathematical expectation or mean value of a discrete random 
variable £, denoted by EZ, we mean the quantity 


EE = 3 xP,(x), (4.9) 


provided that the series converges absolutely.” Here the summation has the 
same meaning as on p. 37, and, as usual, P.(x) = P{Z = x}. Given a 
discrete random variable &, consider the new random variable 4 = 9(), 
where ¢(x) is some function of x. Then the mathematical expectation of 7 is 
given in terms of the probability distribution of & by the formula 


En = Bo(®) =S Poo. (4.10) 


~ 
7 Le., provided that = |x| P(x) < «0, 
So 
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In fact, 4 is a discrete random variable taking only the values y = 9(x), 
where x ranges over all possible values of the random variable —. Therefore® 
Pin=y}= LY Pel 

aola—=v 


where the summation is over all x such that (x) = y, and hence 


En = SP =yy=5 ¥ P,(x) = ¥ ox), 


<2 m@la=v 
as asserted. 

More generally, let (£1, ,) be a random variable which is a function of 
two random variables & and &, with joint probability distribution 
Ps,5:,(%15 X2). Then it is easily verified that 9(€,, €:) has the mathematical 
expectation 


Eo(é1, &2) => > (X1, X2)P es, eq(X15 ¥2)- (4.11) 


© —0 


It is clear from (4.9) that 


a) El = 1; 
b) E(cé) = cEé for an arbitrary constant c; 
c) |Eé| < E[é]. (4.12) 


Moreover, it follows from (4.11) that 
d) E(é, + &) = E&, + EZ, for arbitrary random variables & and &, 
with mathematical expectations EZ, and E&,; 
e) & > O implies EZ > 0, and more generally 


€,< & implies Et, < E&,; (4.13) 
f) If €, and &, are independent random variables, then 
EE,5: = EE, Eby. (4.14) 


For example, to prove (4.14), we write 9(&1, 6) = &,%. Then, for inde- 
pendent &, and &,, (4.11) implies 


BEE) =¥ SxrvePs,CedP,,(%2) 
= S xP, 00) Sx0P eo) = EE, EE. 


To define the mathematical expectation of a continuous random variable 
&, we first approximate — by a sequence of discrete random variables &,,, 


® The colon should be read as “such that.” 
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n=1,2,... Lete,,n=1,2,...be a sequence of positive numbers con- 
verging to zero. Then for each n = 1, 2,..., let 
+> X9,n» X1,n» X0,n» X1,n» Xa,no +> + (4,15) 


be an infinite set of distinct points such that® 
SUP [Xi,n — Xe-a,nl = Ens (4.16) 
k 


and let ,, be a discrete random variable such that 


En = Xk,n if Xpayn <6 < Xn 


[a eal maces 
and hence 


[Bre = Sal) 1S — "Gal F 1Gn — Sl Sey te 
asm, n-» oo. Therefore, by (4.12) and (4.13), 
\EE,, — EE nl = IE(Em — &n)1 < ElEm — Enl < &m + +0 
as m,n» oo (provided E&,, exists for all m). But then 
lim Eé,, 


n> 


It follows that 


exists, by the Cauchy convergence criterion. This limit is called the mathe- 
matical expectation or mean value of the continuous random variable &, 
again denoted by E&. Clearly, 


oo 
EE =lim > Xp nP{Xp-1.n < © < Xen} 
n-+0 0 


Suppose & has the probability density p,(x). Then, choosing the points 
(4.15) to be continuity points of p,(x), we have 


ae ey — 
Dre nP (an <E < Xen} = Sain [2 re) dx 
~ > Xe nPe( Xen )Xe mn —Xptw) 
0 
and hence 


Bz =|" xpy(x) ax (4.17) 


[compare this with (4.9)]. For continuous random variables of the form 


® The symbol sup denotes the supremum or least upper bound. Therefore the left-hand 
side of (4.16) is the least upper bound of all the differences |xx,. — Xs-1.n|,k =~--,—2, 
—1,0,1,2,... Thus (4.16) means that no two of the points (4,15) are more than <, 
apart. Note also that any closed interval of length ¢, contains at least two of the points 
(4.15). 
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n = 9(&) and 4 = 9(E1, &:), we have 


Eo(2) =|" ox)px(x) dx (4.18) 
and 


Bots. 2a) =] |” oles X2)Pe sda X2) dy dX, (4.19) 


by analogy with (4.10) and (4.11), where pg, ¢,(x1, x9) is the joint probability 
density of the random variables 2, and &,. It is easily verified that properties 
a)-f) of the mathematical expectation continue to hold for continuous 
random variables (the details are left as an exercise). 


Remark. Other synonyms for the mathematical expectation of a random 
variable £, discrete or continuous, are the expected value of & or the average 
(value) of &. The mathematical expectation and mean value are often simply 
called the “expectation” and “mean,” respectively. 


Example Let = be a random variable uniformly distributed in the interval 
{a, 6], i.e., let have the probability density 


if aa x D, 
p(x) = b—a > 


0) if xa or «>, 


Then the mathematical expectation of & is 


‘b 
Ree a+b. 


eh) Ste] 2 


A random variable of the form 7 = & + i&, involving two real random 
variables &, and &, (the real and imaginary parts of 7) is called a complex 
random variable. The mathematical expectation of 1 = &, + i& is defined as 


Ey = EG, + ibE. 


It is easy to see that formulas (4.10) and (4.18) remain valid for the case 
where 9(£) is a complex-valued function of a real random variable &, and 
that (4.11) and (4.19) remain valid for the case where 9(&,, 2) is a complex- 
valued function of two real random variables &, and & . In particular, let 
1(E:) and (&) be complex-valued functions of two independent real 
random variables , and &, Then, choosing (&1, 2) = 91(&1)¢0(&) in 
(4.11) or (4.19), we deduce the formula 


E[91(61) p2(s)] = Eqi(S:)E¢o(E), (4.20) 
which generalizes (4.14). 


48 RANDOM VARIABLES CHAP. 4 


9. Chebyshev’s Inequality. The Variance and 
Correlation Coefficient 


By the mean square value of a (real) random variable — is meant the 
quantity EZ*, equal to 


Ee? = > x*P(x) 
if & is discrete, or a 
Ee? =|° x*pax) dx 
if & is continuous.” Given any random variable — and any number e > 0, 
let 
z 0 if |El<e, 
|e if [El Se. 
Then obviously &, < &, and hence, by (4.13), 
EE, < Ee, 
or equivalently 
eP{|6| > ¢} < EB, 
since clearly 
EE, = &*P{|E| > ¢}. 
It follows that 


P {le > 2} < SEe (4.21) 


a result known as Chebyshev’s inequality. According to (4.21), if EZ*/e* < 8, 
then P {|| ><} < 8, and hence P{|&| < c} > 1 — 8. Therefore, if 8 is 
small, it is highly probable that |2| < ¢. In particular, if E&* = 0, then 
P{\&| > ¢} = 0 for every c > 0, and hence & = 0 with probability 1. 

By the variance or dispersion of a random variable &, denoted by Dé, 
we mean the mean square value E(~ — a)? of the difference & — a, where 
a = E& is the mean value of &. It follows from 


E(& — a)? = E&* — 2aKE + a® = EE? — 2a? + a? 


that 
DE = EE? — a’. 
Obviously 
Di— 0; 
and 
D(cé) = DE 


for an arbitrary constant c. 


9 It is assumed, of course, that E2* exists. This is not always the case (see e.g., Problem 
24, p. 53). 
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If &, and &, are independent random variables, then 
D(E, + &) = DE, + Dé. 
In fact, if a, = E&, and a, = E&g, then, by (4.14), 
E(E, — ay)(2s — a) = E(, — a )E(E2 — a) = 0, (4.22) 


and hence 
D(E, + &2) = E(E: + & — a — a)? 
E(E: — 4)? + 2E(&, — a,)(2 — ae) + E(E, — ay)* 
E(é, — a;)? + E(€, — a)? = Dé, + DE. 
Given two random variables £, and &y, we now consider the problem of 
finding the linear expression of the form ¢, + 69, involving constants ¢, 


and @, such that ¢, + é,&» is the best “mean square approximation” to &,, 
in the sense that 
E@,—4& x2)" min E (2; — ¢ Cab)"s (4.23) 
cree 
where the minimum is taken with respect to all c, and cy, To solve this 
problem, we let 


a, = E&,, oy = Da, 


(4.24) 
a, = Eé,, oi = DE, 
and introduce the quantity 
rye EE, — ay)(62 — aa) (4.25) 


6402 
called the correlation coefficient of the random variables €, and &. Going 
over for convenience to the “normalized” random variables 


oe meh} 
= B= re Ba ae 
bo Se 


we find that 
min E(é, — ¢, — C89)" = of min E(n) — ¢, — cya)” (4.26) 


e1.¢2 1.02 
(why ?). Clearly, 

En, = Eq, = 0, Dy, En; Dy, = Enz =1, Ene =r, 

E(q: — ra)n2 = Enrne rEn3 = 0, 

E(q, — rm)? = Enq — 2rEqyys + PEN: = 1 — 1°, 
and hence 

E(q1 — C1 — C22) = El(ni — 12) — ex + (7 — 2) 42)? 

= (lL —r*) + f+ (r — cy)? 
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for arbitrary c, and cy. It follows that the minimum of E(y; — ¢; — ca)? 
is achieved for c, = 0, cp = r, and is equal to 
min E(q — ¢1 — ¢am2)?» = 1 — 7°. (4.27) 


61.09 


But 
1 
ji — a E ay 3 (Es a)| 
1 Oy 


6G. 


in terms of the original random variables €, and &. Therefore 
é + dob, = a, + (Eo — ay) 
ir) 


is the minimizing linear expression figuring in (4.23), where a1, dy, oj, oj and 
rare defined by (4.24) and (4.25). 
If &, and &, are independent, then, by (4.22), 


EE, — a1)(&2 — ae) __ E(E, — a)E(E — aa) 0. 


0102 9102 
It is clear from (4.27) that r lies in the interval —1 < r < 1. Moreover, if 
r= +1, then the random variable & is simply a linear expression of the 


form 
an =6,+ 6 ‘Eo. 
In fact, if r = +1, then, by (4.26) and (4.27), the mean square value of 
E, — ¢, — G8, is just 
EE, — ¢ CE)” oi(1 r) 0, 
and hence &, — ¢, — é,%, = 0 with probability 1 (why?). 

The above considerations seem to suggest the use of r as a measure of the 
extent to which the random variables &, and & are dependent. However, 
although suitable in some situations (see Problem 15, p. 67), this use of r 
is not justified in general (see Problem 19, p. 53)." 


PROBLEMS 


1. A motorist encounters four consecutive traffic lights, each equally likely to 
be red or green, Let € be the number of green lights passed by the motorist 
before being stopped by a red light. What is the probability distribution of =? 


2. Give an example of two distinct random variables with the same distribution 
function. 


3. Find the distribution function of the uniformly distributed random variable 
E considered in Example 1, p. 40. 


11 See also W. Feller, op. cit., p. 236. 
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4. A random variable & has probability density 


a 
PIO) = 5 (-—2 <x < o). 


Find 


a) The constant a; b) The distribution function of &; 
c) The probability P{—1 < — < 1}. 


1 ieee 1 
Ans, a)-; b) = +-—arctanx; c) =. 
Tt 2 & 2 


5. A random variable & has probability density 


‘axte-*e if O<x<o, 
P= V9 otherwise 
(k > 0). Find 
a) The constant a; b) The distribution function of &; 
c) The probability P{O < — < 1/k}. 


6. A random variable & has distribution function 


x 


2 


,(x) =a + bare tan (—w <x < 0), 
Find 
a) The constants a and b; b) The probability density of &. 


7. Two nonoverlapping circular disks of radius r are painted on a circular table 
of radius R. A point is then “tossed at random”’ onto the table. What is the 
probability of the point falling in one of the disks? 


Ans. 2(r/R)?. 


8. What is the probability that two randomly chosen numbers between 0 and 1 
will have a sum no greater than 1 and a product no greater than §? 


1 2praede 1 2 
Mang nati — =5+-1In2 0.49. 
3 1/3 xX 2. 9 


9, Given two independent random variables £, and &3, with probability densities 


deal? if ere: heals it eee 0) 


x) = = 
Pes ‘ it x<o, 7A ~|o if x <0, 


find the probability density of the random variable » = & + &. 


(e-2/3(1 — e-#/8) if x >0, 
if <0: 


Ans. p,(x) = f 
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10. Given three independent random variables ¢, & and &3, each uniformly 
distributed in the interval [0, 1], find the probability density of the random 
variable &; + & + &. 

Hint. The probability density of &; + & (say) was found in Example 4, 
p- 43. 


11. A random variable € takes the values 1, 2,..., m,... with probabilities 


ie 1 
Zee 
Find Eé. 
12. Balls are drawn from an urn containing w white balls and b black balls until 


a white ball appears. Find the mean value m and variance o? of the number of 
black balls drawn, assuming that each ball is replaced after being drawn. 
b bw +6 
Ans. m=—,0 sore, 
w w’ 


13, Find the mean and variance of the random variable & with probability 
density 
Py(x) = gen! (—0© <x < 0), 


Ans. E& =0, DE =2., 
14, Find the mean and variance of the random variable & with probability 
density 


= if |x -—al<d, 
wn Hap 
Ps) = 


0 otherwise. 
Ans, E& =a, DE = 57/3. 


15, The distribution function of a random variable & is 


0 if x < -—1, 
(x) =(a+barcsinx if -l<x<1, 
1 if x >1. 


Find Eé and Dé. 
Hint. First determine a and b. 
Ans. E& =0, DE =}. 


16. Let & be the number of spots obtained in throwing an unbiased die, Find 
the mean and variance of &. 
Ans. E& =$, DE = $$. 


17. In the preceding problem, what is the probability P of 2 deviating from 
EE by more than }? Show that Chebyshev’s inequality gives only a very crude 
estimate of P. 
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18. Prove that if 4 is a random variable such that Ee’ exists, where a is a 
positive constant, then 


é 
P{E> < 


Hint. Apply Chebyshev’s inequality to the random variable » = e¥/?. 


19. Let — be a random variable taking each of the values —2, —1, 1 and 2 with 
probability }, andlet 1 = &*. Prove that £ and » (although obviously dependent) 
have correlation coefficient 0. 


20. Find the means and variances of two random variables &, and &, with joint 
probability density 
7 


O<4<5 


sin x, sin X2 if O<x< 
Pe ,,6:0%1» X2) = 


0 otherwise. 


T 
? 


What is the correlation coefficient of &; and &? 


21. Find the correlation coefficient r of two random variables & and & with 
joint probability density 


7 
> 


2 


T 
> 


0<% <5 


1 
oleae 3 sin (4 + %2) if O<x< 
S1,Sg"1s 2) 


0 otherwise. 
Tr ne 
SS 
5 2 16 1 
nm. r= = : a} te dan 4 . 
pee 16 


22. Given a random variable &, let (rt) be a nondecreasing positive function 
such that E¢() exists. Prove that 
P(E>1}<— (4.28) 
> Sore a 
9(t) 
23. Deduce Chebyshev’s inequality as a special case of (4.28). 
24, Let & be a random variable with probability density 


1 
Laas (—% <x < ), 


Show that Eé and Dé fail to exist. 


Pe(x) = 


5 


THREE IMPORTANT 
PROBABILITY DISTRIBUTIONS 


10. Bernoulli Trials. The Binomial and Poisson Distributions 


By Bernoulli trials we mean identical independent experiments in each of 
which an event A, say, may occur with probability 


p= P(A) 
(p # 0) or fail to occur with probability 
q=1—p. 


Occurrence of the event A is called a “success,” and nonoccurrence of A 
(i.e., occurrence of the complementary event A) is called a “failure.” 

In the case of n consecutive Bernoulli trials, each elementary event 
can be described by a sequence like 


1011... 0001 
n times 
consisting of n digits, each a 0 or a 1, where success at the ith trial is denoted 
by a | in the ith place and failure at the ith trial by a 0 in the ith place. Be- 
cause of the independence of the trials, the probability of an elementary 
event w in which there are precisely k successes and n — k failures is just 


P(a) = p*q”™. 
Clearly, the various elementary events are equiprobable only if p = q. 


Now consider the random variable & equal to the total number of suc- 
cesses in n Bernoulli trials, ie., &(w) = k if precisely k successes occur in the 


54 
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elementary event «. The number of distinct elementary events with the same 
total number of successes k is just the number of distinct sequences consisting 
of k ones and n — k zeros, But the number of such sequences is just the 


binomial coefficient 
n n! 
C= ———————= Sal 
¥ (;) k!(n — k)! ea 


equal to the number of combinations of n things taken k at a time (recall 
Theorem 1,3, p. 7). These C}' elementary events all have the same probability 


P(w) = pig’, 
and hence the event {& = k} has probability 
Pe = t= cya. 
Thus the probability distribution of the random variable & is given by 
Pin Cp, k= 0,140.57, (5.2) 


and is known as the binomial distribution. The binomial distribution is 
specified by two parameters, the probability p of a single success and the 
number of trials 7. 

It should be noted that the random variable & is the sum 


B= fi tet by (5.3) 


of n independent random variables Ei, +0> > Gn» Where &, = 1 if “success” 
occurs at the kth trial and &, = 0 if “failure” occurs at the kth trial. We have 


f.=p, DE, = Ee; — (BE, = p— p*= pl — p) = pa. 
Therefore 


EE=np, DE =npg. (5.4) 


Suppose the number of trials is large while the probability of success p 
is relatively small, so that each success is a rather rare event while the average 
number of successes np is appreciable. Then it is a good approximation to 
write 


a oa 3 
P(K)~ Te", k= 0,120.05 (5.5) 


where a = np is the average number of successes and e = 2.718... is the 
base of the natural logarithms. In fact, we know from calculus that 
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But p = a/n, and hence (5.2) gives 
n 
P,(0) = q" = (1 = ‘) ~et. 
n 
Moreover, it is easily found from (5.1) and (5.2) that 


Pk) _ np—(k—1)p a 
Pk — 1) kq k 
asn-—» oo, Therefore 


PAA) ~ : PO) ~ : et, 


2 

a a 
P.(2) ~ — P,(1) ~ — e, 
Q~SPM~ se 


a ee 
PAk)~F Pak — Ee 


which proves the approximate formula (5.5). 
A random variable ~ taking only the integral values 0,1, 2,... is said 
to have a Poisson distribution if 
ees 
Pik) =e ye edoreeniOdla eos eteia (5.6) 


The distribution (5.6) is specified by a single positive parameter a, equal to the 
mean value of &: 


a = EE = SkP,b. 


=O 
In fact, it follows from the expansion 
: oO x* 
&=>—, 
p> k! 
valid for all x, that 
ma Siwy delet aS 
mo mk! i(k — 1)! ; 


Remark. Thus the approximate formula (5.5) shows that the total 
number of successes in ” Bernoulli trials has an approximately Poisson 
distribution with parameter a = np, if n is large and the probability of 
success p is small. 


Example 1 (The lottery ticket problem). How many lottery tickets must 
be bought to make the probability of winning at least P? 
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Solution. Let N be the total number of lottery tickets and M the total 
number of winning tickets. Then M/N is the probability that a bought ticket 
is one of the winning tickets. The purchase of each ticket can be regarded as 
a separate trial with probability of “success” p = M/N in a series of n inde- 
pendent trials, where n is the number of tickets bought. If p is relatively small, 
as is usually the case, and the given probability P is relatively large, then it is 
clear that a rather large number of tickets must be bought to make the 
probability of buying at least one winning ticket no smaller than P. Hence 
the number of winning tickets among those purchased is a random variable 
with an approximately Poisson distribution, i.e., the probability that there 
are precisely k winning tickets among the n purchased tickets is 


a 
P(k) © a ORS 
where 


e=n—. 


N 


The probability that at least one of the tickets is a winning ticket is just 


1— P(0)=1-—e”. 


Hence n must be at least as large as the smallest positive integer satisfying 
the inequality 
es eeuNe {ap 


Example 2 (The raisin bun problem). Suppose N raisin buns of equal size 
are baked from a batch of dough into which n raisins have been carefully 
mixed. Then clearly the number of raisins will vary from bun to bun, 
although the average number of raisins per bun is just a = n/N. What is the 
probability that any given bun will contain at least one raisin? 


Solution. It is natural to assume that the volume of the raisins is much 
less than that of the dough, so that the raisins move around freely and 
virtually independently during the mixing, and hence whether or not a 
given raisin ends up in a given bun does not depend on what happens to the 
other raisins. Clearly, the raisins will be approximately uniformly distri- 
buted throughout the dough after careful mixing, i.e., every raisin has the 
same probability 


= 


Zle 


of ending up in a given bun.’ Imagine the raisins numbered from 1 to n, 


1 If v is the volume of the raisins and V that of the dough, then p = v/V. 
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and select a bun at random, Then we can interpret the problem in term as of 
series of n Bernoulli trials, where “success” at the kth trial means that the 
kth raisin ends up in the given bun. Suppose both the number of rolls N and 
the number of raisins x are large, so that in particular p = 1/N is small. Then 
the number of successes in the n trials, equal to the number of raisins in the 
given bun, has an approximately Poisson distribution, i.e., the probability 
P(k) of exactly k raisins appearing in the bun is given by 


where 


Hence the probability P of at least one raisin appearing in the bun is 
P=1—P(0)=1-—e™. 


Example 3 (Radioactive decay). It is observed experimentally that 
radium gradually decays into radon by emitting alpha particles (helium 
nuclei), The interatomic distances are large enough to justify the assumption 
that (the nucleus of) each radium atom disintegrates independently of all the 
others. Moreover, each of the ny radium atoms initially present clearly has 
the same small probability p(t) of disintegrating during an interval of t 
seconds.? Suppose the disintegration of each radium atom is interpreted as 
a “success.” Then the random variable &(t), equal to the number of alpha 
particles emitted in ¢ seconds, equals the number of successes in a series of 
no Bernoulli trials with probability of success p(t). The values of mp and p(t) 
are such that the distribution of &(¢) is very accurately a Poisson distribution, 
ive., the probability of exactly k alpha particles being emitted is given by 


P {&(t) = k} = oe, k=0,1,2,..., (5.7) 


where 
a = E&(t) = nop(t) 


is the average number of alpha particles emitted in ¢ seconds. 

Here we have used a model involving Bernoulli trials as a tool for showing 
that the random variable &(t) has a Poisson distribution. Another physical 
situation leading to a Poisson distribution is considered in Example 4, p. 73. 


* A gram of radium (% = 10") emits about 10" alpha particles per second. Hence 
P(1) © 109/108 = 10-12, 
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Il. The De Moivre-Laplace Theorem. The Normal Distribution 


Next we prove the following basic “limit theorem’’: 


THEOREM 5.1 (De Moivre-Laplace theorem). Given n independent 


identically distributed random variables %,..., &,, each taking the value 
1 with probability p and the value 0 with probability q = 1 — p, let 
fe S, — ES. 
SoS ’ gt. 2s ee 
de st ES 
Then 
lim P{x’ < S*#< x"} = | eM dx, (5.8) 
no sh 27 Ja’ 


Proof. S,, is the random variable denoted by € in (5.3) and (5.4), 
i.e., S,, is the number of successes in n Bernoulli trials, with mean and 
variance 

ES,, = np, DS, = npq. 


Hence the “normalized sum’’ S* is a random variable taking the values 
ES 
x= 
: i Vnpa 
with probabilities 


* n! ke an—k ‘-- 
P {St = x} = P,(k) d@opiet” k=0,1,...,n. 


These values divide the interval 


k=0,1,...,n 


[ npq =A 
Japa . Vnpa 
into n equal subintervals of length 


1 


Ax = —. 
Vnpq 


Clearly, as n — 00, 


k=np+Vnpqgx>o0, n—k=ng—Vnpqx—>o, 
where the convergence is uniform in every finite interval x’ < x < x’, 
Using Stirling’s formula (see p. 10), we find that 


/2mn n"e" k nk 
2k kte* /2n(n — k) (n — bk)“ eae 


Cote ner 


Pk) ~ 
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fe est x: [el See wee 
np V np nq nq 


Therefore, using the expansion 


Moreover, 


Nv |B 


In (1 + o&) ~ a, — 


(as «, — 0), we have 


-k ma 
In () =-—kIin (1 + £3) 
np, np 
~ (np + mPa) ( Eee: 


np 2np 


—(n-k) 
In ees - ‘) ae —(n — k)In (1 - al 
nq 


— |p ples 
~ x led xi}. 
(nq — ./npq x) ( = ane ) 


Adding these expressions, we find that 


and hence 
in (Go 
nao |. as 


uniformly in every finite interval x’ < x < x". Since 


| n ~| Reirot 
n(n — k) np-ng  ./npq’ 


it follows that 


A i 2 1 
lim P {S* = x} = =e" "Ax, Ax =——. 
no { } J2n Jnpq 
Therefore 
limP {x < St<x}=lim > P{st=x} 


no n+0 a'<e<e" 


F 1 2 
=lim —e*/* Ax, 5.9 
nc Yan J2n eo?) 
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where the sum is over all values of x in the interval x’ < x < x”, But 
clearly 


‘ 1 2 1 ae” os 
lim — eT Ax = —— | eo *? dx 5.10 
n> 0 jee 20 = J2n a ¢ ) 
(why ?). Comparing (5.9) and (5.10), we finally get the desired limiting 
formula (5.8). 


According to Theorem 5.1, the limiting distribution of the random 
variable S* is the distribution with probability density 


1 2 
x) = —— eo? 5.11 
P(x) Tin (5.11) 
Such a distribution is called a normal (or Gaussian) distribution. The density 


p(x) is the “bell-shaped” curve shown in Figure 8(a). The corresponding 
distribution function is 


One [lew du, (5.12) 
Te 


ere _w<xew 


7 


Ficure 8 
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Table 2. Values of the normal 
distribution function ®(x) given 


by formula (5.12). 


x D(x) 
0.0 0.5000 
0.1 0.5398 
0.2 0.5793 
0.3 0.6179 
0.4 0.6554 
0.5 0.6915 
0.6 0.7257 
0.7 0.7580 
0.8 0.7881 
0.9 0.8159 
1.0 0.8413 
11 0.8643 
1.2 0.8849 
1.3 0.9032 
1.4 0.9192 
1,5 0.9332 
16 0.9452 
1.7 0.9554 
1.8 0.9641 
1.9 0.9713 
2.0 0.9773 

. 21 0.9821 
2.2 0.9861 
2.3 0.9893 
2.4 0.9918 
2.5 0.9938 
2.6 0.9953 
Pan 0.9965 
2.8 0.9974 
2.9 0.9981 
3.0 0.9986 


CHAP. 5 
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and is shown in Figure 8(b), Since p(x) is even, it is clear that 
®(—x) = 1 — ®(x). 
Representative values of (x) are given in Table 2. 
Let & be anormal (or Gaussian) random variable, i.e., a random variable 
with probability density (5.11). Then 
1 2 
EE = —| xe? dx = 0, 
‘ Ay: 2a i —o0 


since the integrand is odd. Moreover, 


DE = Fe" — (ke) = we? = [tetas 


Dé = im [- [ * xe) 


Hence & has variance 1. More generally, the random variable with probability 
density 


1 —(e—a)*/20% 
Y= ————e. 5.13 
P(x) ine (5.13) 
is also called a normal random variable, and has mean a and variance o? 
(show this). 


Example (Brownian motion). Suppose a tiny particle is suspended in a 
homogeneous liquid. Then the particle undergoes random collisions with the 
molecules of the liquid, and, as a result, moves about continually in a 
chaotic fashion. This is the phenomenon of Brownian motion. As a model of 
Brownian motion, we make the following simplifying assumptions, charac- 
terizing a ‘‘discrete random walk” in one dimension: 


1) The particle moves only along the x-axis. 

2) The particle moves only at the times t = nAt,n = 0,1, 2,... 

3) Suppose the particle is at position x at time t. Then, regardless of its 
previous behavior, the particle moves to either of the two neigh- 
boring positions x + Ax and x — Ax (Ax > 0) with probability 4. 
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In other words, at each step the particle undergoes a shift of amount 
Ax either to the right or to the left, with equal probability.* 


Now let &(t) denote the position of our “Brownian particle” at time ¢, 
and suppose the particle is at the point x = 0 at time t = 0, so that £(0) = 0. 
Then after t = nAt seconds, the particle undergoes n displacements of amount 
Ax, of which S,,, say, are to the right (the positive direction) and n — S,, to 
the left (the negative direction). As a result, the position of the particle at 
time ¢ = nAt is just 


&(1) = [S, Ax — (n — S,) Ax] = (28, — n) Ax. (5.14) 
Moreover, since &(0) = 0, we have 
EO = [&() — §)] + [&() — &)] 


for any sin the interval0 < s < t (for the time being, s is an integral multiple 
of Ax). With our assumptions, it is clear that the increments &(s) — &(0) 
and &(t) — &(s) are independent random variables, and that the probability 
distribution of E(t) — &(s) is the same as that of E(¢ — s) — &(0). Therefore 
the variance o°(t) = D&(t) satisfies the relation 


o*(t) = o%(s) + o%(t —s), Ogee t: 
It follows that 6°(t) is proportional to 1, i.e.,4 
DE(t) = ot, (5.15) 


where o? is a constant called the diffusion coefficient. On the other hand, it is 
easy to see that after a time f, i.e., after n = t/At steps, the variance of the 
displacement must be 


ares 
Det) = Gee tf (5.16) 
Comparing (5.15) and (5.16), we obtain 
2 _ (Ax)? 
a ae (5.17) 


The displacements of the particle are independent of one another and can 
be regarded as Bernoulli trials with probability of “success” p = }, “success” 
being interpreted as a displacement in the positive direction. In this sense, 
the number of displacements S,, in the positive direction is just the number of 


* We will eventually pass to the limit Ar + 0, Ax + 0, thereby getting the “continuous 
random walk” characteristic of the actual physical process of Brownian motion. 
* See footnote 4, p. 40. 
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successes in n Bernoulli trials. Moreover, the relation between the particle’s 
position at time ¢ and the normalized random variable 


peta a Le 9 


VnPq vn 


a + 5 Ax 
E(t) = Si,/n Ax = Sh,/t Ti Srov/i, 


because of (5.14) and (5.17). Applying Theorem 5.1, in particular formula 
(5.8), and passing to the limit At—0 while holding o constant (so that 
Ax — 0), we find that the random variable &(t) describing the one-dimen- 
sional Brownian motion satisfies the formula 
P{y << x'} = lim P {x' < St< x"}= |" eo? dx, 
o/t Ato Vine 


Therefore E(t) is a normal random variable with probability distribution 


is given by 


Pax cacti ef = J er etert dy, 


o. 


PROBLEMS 


1. Consider the game of “heads or tails,” as in Example 3, p. 29. Show that 
the probability of correctly calling the side of the coin landing upward is always 
4 regardless of the call, provided the coin is unbiased. However, show that if 
the coin is biased, then ‘‘heads”’ should be called all the time if heads are more 
likely, while “tails” should be called all the time if tails are more likely. 


2. There are 10 children in a given family. Assuming that a boy is as likely to 
be born as a girl, what is the probability of the family having 
a) 5 boys and 5 girls; b) From 3 to 7 boys? 


3. Suppose the probability of hitting a target with a single shot is 0.001. What 
is the probability P of hitting the target 2 or more times in 5000 shots? 
Ans, P ~1 — 6e* = 0.96. 


4. The page proof of a 500-page book contains 500 misprints. What is the 
probability P of 2 or more misprints appearing on the same page? 


5 
Ans. P= 1—=— 0.08. 
2e 


5. Let p be the probability of success in a series of Bernoulli trials. What is the 
probability P,, of an even number of successes in 7 trials? 


Ans. Py, =$[1 + (1 — 2p)". 
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6. What is the probability of the pattern SFS appearing infinitely often in an 
infinite series of Bernoulli trials, if S denotes “success” and F ‘failure’? 


Hint. Apply the second Borel-Cantelli lemma (Theorem 3.1, p. 33).° 
Ans. 1. 
7. An electronic computer contains 1000 transistors. Suppose each transistor 


has probability 0.001 of failing in the course of a year of operation. What is the 
probability of at least 3 transistors failing in a year? 


8. Aschool has 730 students. What is the probability that exactly 4 students were 
born on January 1? 


Hint. Neglect leap years. 
9. Let & be a random variable with the Poisson distribution (5.6). Find 


mas 3 
a) t= DE ES, 


1 
Ans. a) a; b) —=. 
) ) Ta 
10. Where is the uniform convergence used in the proof of Theorem 5.1? 
11. The probability of occurrence of an event A in one trial is 0.3. What is the 
probability P that the relative frequency of A in 100 independent trials will lie 
between 0.2 and 0.4? 


Hint. Use Theorem 5.1 and Table 2. 
Ans. P ~ 0.97. 


12. Suppose an event A has probability 0.4. How many trials must be performed 
to assert with probability 0.9 that the relative frequency of A differs from 0.4 by 
no more than 0.1? 


Ans. About 65. 


13. The probability of occurrence of an event A in one trial is 0.6. What is the 
probability P that A occurs in the majority of 60 trials? 


Ans. P = 0.94, 


14. Two continuous random variables &; and &, are said to have a bivariate 
normal distribution if their joint probability density is 

1 
2no,o,V1 — r? 


1 E —a)? oe (x, — a)(x_ — b) ms ey = |} 


BECkD [- 2-A| & aa a 


Pb,,84(%19 X2) = 


(5.18) 


* For further details, see W. Feller, op. cit., p. 202. 
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where 5, > 0, o. > 0, —1 <r <1. Prove that each of the random variables 
€, and &, has a univariate (i.e., one-dimensional) normal distribution of the 
form (5.13), where EE, = a, DE, = o?, EE, = 6, DE, = 03. 


Hint. Clearly, 


Pea) =|" Pesta 99) day Pera) =]. Peg salva) dr 
(why ?). 


15. Prove that the number r in (5.18) is the correlation coefficient of the random 
variables £, and £,. Prove that £, and & are independent if and only if r = 0. 


Comment. This is a situation in which r is a satisfactory measure of the 
extent to which the random variables £; and & are dependent (the larger |r|, 
the ‘‘more dependent” &, and &). 


16. Let &, and &, be the same as in Problem 14. Find the probability distribution 
of 7 = & + &. 


Ans. The random variable 7 is normal, with probability density 


(2) | [ (x —a —b)? ] 
Y= e7xp | — 4 
Pai V2n(o2 + 2ro,o, + 02) P 2(o} + 2royo, + 93) 


6 


SOME LIMIT THEOREMS 


12. The Law of Large Numbers 


Consider n independent identically distributed random variables &,,..., 
&,. In particular, &,...,&, have the same mean a = EE, and variance 
a= De. if 


n=*Gite +8) 
n 


is the arithmetic mean of the variables ),..., &,,, then 


12 
En = — DEE, =a, 


Nk=1 


1 o* 
Dy = Ey — a)” 2 >DE.=—. 
Nn” fat n 


Applying Chebyshev’s inequality (see Sec. 9) to the random variable y — a, 
we get the inequality 


a 


P{in—al>e} < 4Bq—at=5 (6.1) 
€ ne 


for arbitrary « > 0. 


THEOREM 6.1 (Weak law of large numbers). Let &,,..., &, be n inde- 
pendent identically distributed random variables with mean a and variance 
o*. Then, given any 8 > 0 and « > 0, however small, there is an integer 


68 
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n such that 


a e<ih fers +E) cate 


with probability greater than 1 — 8. 


Proof. The theorem is an immediate consequence of (6.1) if we choose 
n>o/de*. | 
Remark, Suppose 8 and ¢ are so small that we can practically neglect 
both the occurrence of events of probability 5 and differences between quanti- 
ties differing by no more than. Then Theorem 6.1 asserts that for sufficiently 
large n, the arithmetic mean 


=i 


n= tees + ba) 


is an excellent approximation to the mean value a = E&, 


Now consider n consecutive Bernoulli trials, in each of which an event 
A can occur with probability p = P(A) or fail to occur with probability 
q=1-—>p. Let &, be a random variable equal to 1 if A occurs at the kth 
trial and 0 if A fails to occur at the kth trial. Then the random variables 
t,,..-, & are independent and identically distributed (by the very meaning 
of Bernoulli trials). Obviously 


P{&==p, P{e==4. 
Moreover, each random variable &, has mean 


a=Ke,=p-'1+q'0=p=P(A). 
Let (A) be the number of trials in which A occurs, so that 


n(A) 
n 


is the relative frequency of the event A. Then clearly 


n(A) = &, +--+ &, 
and hence 
n(A 1 
mA) tee +8) 
n n 
It follows from Theorem 6.1 that n(A)/n virtually coincides with P(A) for 
sufficiently large n, more exactly, that given any 3 > 0 and < > 0, however 
small, there is an integer n such that 
n(A) 


EA) sage Somers A) ae 
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with probability greater than 1 — 5. The justification for formula (1.2), 
p. 3 is now apparent. 


Remark, It can be shown! that with probability 1 the limit 
n(A) 


lim —— 

no 
exists and equals P(A). This result is known as the strong law of large 
numbers. 


13. Generating Functions. Weak Convergence of Probability 
Distributions 


Let & be a discrete random variable taking the values 0,1, 2,... with 
probabilities 
P,(k) = P {E = k}, 05 12) ei (6.2) 
Then the function 


F.(2) = 3 Pez", lzl<1 (6.3) 


is called the generating function of the random variable & or of the corre- 
sponding probability distributions (6.2). It follows from the convergence of 
the series (6.3) for |z| = 1 and from Weierstrass’s theorem on uniformly 
convergent series of analytic functions? that F,(z) is an analytic function of z 
in |z| < 1, with (6.3) as its power series expansion. Moreover, the probability 
distribution of the random variable & is uniquely determined by its generating 
function F,(z), and in fact 


1 
Pi(k) = 7 FPO), k=0,1,2,..., 
where F{*)(z) is the kth derivative of F,(z). According to formula (4.10), 


p. 44, for fixed z the function F,(z) is just the mathematical expectation of 
the random variable (2) = z®, ie., 


F, (2) = Ez’, Fah -an (§ (6.4) 


Example 1 (The Poisson distribution). If the random variable & has a 
Poisson distribution with parameter a, so that 


Pk) =e K= On 2 
a Sipeleay 


1 See e.g., W. Feller, op. cit., p. 203. 
2 See e.g., R. A. Silverman, Introductory Complex Analysis, Prentice-Hall, Inc., 
Englewood Cliffs, N.J. (1967), p. 191. Also use Weierstrass’ M-test (ibid., p. 186). 


SEC. 13 SOME LIMIT THEOREMS 71 
then & has the generating function 


4, 


< a® Pee az eaten) 
ot Gali ar * mall 9) 


Suppose the random variable has mean a = EE and variance o? = DE. 
Then, differentiating (6.4) twice with respect to z behind the expectation sign 
and setting z = 1, we get 


a=EE=F(l), of = EG? — (EER =F" + F (I) — (FOE 
(6.6) 


The same formulas can easily be deduced from the power series (6.3). In 
fact, differentiating (6.3) for |z| < 1, we get 


F(z) = SkP(K)2, 
k=0 


and hence 


Ms 


a= 


kP,(k) = lim Fi(z) = Fi{(1), 


k=0 a1 


and similarly for the second of the formulas (6.6). 

Next let &,,..., &, be m independent random variables taking the values 
0,1,2,... Then the random variables z®, .. . , 2°", where zis a fixed number, 
are also independent. It follows from formula (4.20), p. 47 that 


Ezit +n) — Bzht. ss zie = Ezt+ ++ Ez‘, 


Thus we have the formula 


F,(z) = F,,(z)*** F;,(2); (6.7) 
expressing the generating function F,(z) = Ez‘ of thesum & = &, + -::+ &, 
of the m random variables @,,..., &, in terms of the generating functions 
F;,(2) = Ez, k =1,...,n of the separate summands. 


Example 2 (The binomial distribution). Suppose the random variable & 
has a binomial distribution with parameters p and n, so that 


Pk) = Cip'g"* g=1—p, k=0,1,...,2. 
Then, as already noted in Sec. 10, & can be regarded as the sum jar 
&,-+-++-+ &, of n independent random variables &,,..., &,, where 


1 with probability p, 


*~ |o with probability g. 
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The generating function F;,(z) of each summand is clearly pz +- q, and hence, 
by (6.7), the generating function of & itself is 


Fy) = (pz +9)". (6.8) 
Now let &,, 7 =1,2,... be a sequence of discrete random variables 
taking the values 0,1, 2,..., with probability distributions P,(k) = P,,(k) 
and generating functions F(z), n = 1,2,... Then the sequence of distri- 
butions {P,,(k)} is said to converge weakly to the limiting distribution P(x) if 
lim P,,(k) = P(k) (6.9) 

n> 0 


for all k = 0, 1,2,... 


Example 3 (Weak convergence of the binomial distribution to the Poisson 
distribution). Let &, &,... be a sequence of random variables such that 
é, has a binomial distribution P,,(k) with parameters p and n, i.e., 


P,(k) = Chpg”",  g=1—p. 
Suppose p depends on n in such a way that the limit 


lim np =a (6.10) 


no 
exists. Then, according to formula (5.5), p. 55, the sequence of distributions 
{P,,(k)} converges weakly to the Poisson distribution 
P(k) = a oa k=0,1,2 
Hi ens =U, 1,2,-+5 


with parameter a given by (6.10). 
In Example 3, the sequence of generating functions 
F,(@) =(pz+q)", n=1,2,... 


of the random variables ), &,... converges uniformly to the generating 
function F(z) = e**-) of the limiting Poisson distribution, i.e., 


lim F(z) =lim [1 + p( — 1)] =lim [! ak 


n> 0 n> oO n> 0 n 
aie 
=lim [! + gem = etn 
n> 0 n 


(justify the next-to-the-last step). This is no accident, as shown by 


THEOREM 6.2. The sequence of probability distributions P,,(k), n = 1, 
2,... with generating functions F,(z),n = 1,2,... converges weakly to 
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the limiting distribution P(k) if and only if 
lim F,,(z) = F(z), (6.11) 


no 


where 
F(z) = > P(k)2* 
k=0 
is the generating function of P(k) and the convergence is uniform in every 
disk |z) <r<1. 


Proof. First suppose (6.9) holds. Clearly, 


K o 
[F,(z) — F@)| < D1Pa(k) — P+ 2% lel (6.12) 

k=0 k=K+1 
for any positive integer K. Given any ¢ > 0, we first choose K large 


enough to make 
K+ 


< r € 
> lzk< eg 
ke K+1 1l—r 2 
and then find a positive integer N such that 


€ 
2(K + 1) 
holds fork =0,...,Kifn > N. It then follows from (6.12) that 
|F,(z) — F(@)| < 


|P»(k) — P(k)| < 


ifn > N, which immediately proves (6.11). 

Conversely, suppose (6.11) holds, where the convergence is uniform 
in every disk |z| < r < 1. Then, by Weierstrass’ theorem on uniformly 
convergent sequences of analytic functions,® 

lim F(z) = F(z), |z| <1 (6.13) 


no 


for all k =0,1,2,.... But 
_1 pm —1 pw, 
P,(k) = ki Fr (0), = P(k) = ki F''0), 
and hence (6.13) implies (6.9) for allk = 0,1,2,... J 
The following example is typical of the situations where the Poisson 
distribution is encountered: 


Example 4 (Random flow of events). Suppose that events of a given kind 
occur randomly in the course of time. For example, we can think in terms 


*R. A, Silverman, op. cit., p. 192. 
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of “‘service calls’ (requests for service) arriving randomly at some “server’’ 
(service facility), like inquiries at an information desk, arrival of motorists 
ata gas station, telephone calls at an exchange, etc. Let (A) be the number of 
events occurring during the time interval A. Then what is the distribution of 
the random variable &(A)? 

To answer this question, we will assume that our “random flow of events” 
has the following three properties: 


a) The events are independent of one another; more exactly, the random 
variables &(A,), (A), ... are independent if the intervals A,, A,,... 
are nonoverlapping. 

b) The flow of events is “stationary,” i.e., the distribution of the random 
variable €(A) depends only on the length of the interval A and not on 
the time of its occurrence (the initial time of A, say). 

c) The probability that at least one event occurs in a small time interval 
At is AAt + o(At), while the probability that more than one event 
occurs in At is o(At). Here o(A‘)is an infinitesimal of higher order than 
At, ie., 

ee) = 

at+o At 
and dis a positive parameter characterizing the “rate of occurrence” 
or “density” of the events. 


Now consider the time interval A = {0, t], and let &(¢) be the total 


number of events occurring in [0, ¢]. Dividing [0,f] into m equal parts 
A,,...,A,, we find that 


0, 


2) = SE(Ay, 
k=l 


where &(A,),..., &(A,,) are independent random variables and &(A,) is the 
number of events occurring in the interval A,. Clearly, the generating 
function of each random variable &(A,) is 


ram (1-2) +E sol 


where o(t/n) is a term of order higher than t/n. Hence, by (6.7), the generating 
function of E(t) is 


Fe) =[F,(2))" [! Me), o(*) |: 


n 


But F(z) is independent of the subintervals A,,..., A, and hence we can 
take the limit as n > oo, obtaining 


F(z) =lim [! ee ai —igtle), 


n> n 
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Comparing this with (6.5), we find that F(z) is the generating function of a 
Poisson distribution with parameter a = 4, so that 


: 
P{E(t) = k} -_a 02... 


Since 
M = EE), 


the parameter ) is just the average number of events occurring per unit time. 


14, Characteristic Functions. The Central Limit Theorem 


Given a real random variable &, by the characteristic Junction of & is 
meant the function 


f(t) = Ee, —0o<t<o. (6.14) 


Clearly, f:(t) coincides for every fixed ¢ with the mathematical expectation of 
the complex random variable 4 = e“*'. For a discrete random variable taking 
the values 0,1,2,..., the characteristic function F(t) coincides with the 
values of the generating function F(z) on the boundary of the unit circle 
|z| = 1, ie., 


f(t) = Fe") = S P(kye*, 
k=0 


This formula represents /,(f) as a Fourier series, with the probabilities 
PAk) = P{E =k}, k =0,1,2,... as its coefficients. Thus these proba- 
bilities P.(k) are uniquely determined by the characteristic function /,(t). 

If € is a continuous random variable with probability density p,(x), then, 
by formula (4.18), p. 47, the characteristic function is the Fourier transform 
of the density p,(x): 


ft) = fF pelo) dx. (6.15) 
Inverting (6.15), we find that 


ay 2 iat, 
P(x) = roa lg 'f.(t) dt, (6.16) 


at least at points where p,(x) is suitably well-behaved.* Thus p,(x) is uniquely 
determined by the characteristic function /;,(t). 


* If (6.16) fails, another inversion formula can be used, giving the distribution function 
®,(x) = P{E < x} in terms of the characteristic function /,(¢) (see e.g., B. V. Gnedenko, 
op. cit., Sec. 36). We can then deduce p,(x) from ®,(x) by differentiation, at least almost 
everywhere (recall footnote 2, p. 38). 
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Example 1. Let & be a normally distributed random variable, with 
probability density 


1-2 
)=——< é 6.17 
P(x) Jim (6.17) 
Then, by (6.15), the characteristic function of & is given by 
—[* pia ~~ an ee t—(a?/2) 
$0) =| p(x) ax = Le J” ete" dy 


J2n 


= Pe) ee = eet gy, 
Are 
The function 9(z) = e~**/? is an analytic function of the complex variable z, 
and hence, by Cauchy’s integral theorem,° the integral of 9(z) along the 
rectangular contour with vertices (—N,0), (N,0), (N, —it), (—N, —it) 
equals zero. Therefore 


(6.18) 


1 pe tet ; LY oto-10)8 
—— e Qe MOnG ee ep ax, 
V2 ie noe In Pry 
1 ‘N-it 2 1 'N 2 
= lim —— #8 dz = lim —— e* dx (619 
Noo /27 ise, N--x ./27 te sie 


where we use the fact that the integral of (z) along the vertical sides of the 
contour vanishes as N — oo (why ?). But 


ce [Pet ax =|" nx) dx =1, 


as for any probability density. Hence (6.18) and (6.19) imply 
ft) = e PP. (6.20) 


Now suppose the random variable & is such that E |&|* exists. Then the 
characteristic function /;(t) has the expansion 


felt) = 1+ iEE +t — me + R(t), (6.21) 


where the remainder R(t) satisfies the estimate 


|R@)| < CE |é|§ = |e/* 


*R. A. Silverman, op. cit., p. 146. 
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(C denotes a constant). In fact, we need only note that 


2 
eft — 1 + it — = P+ 6, (6.22) 


by Taylor’s formula, where 
[8] < Cleo 48. 


We then get (6.21) by taking the mathematical expectation of both sides of 
(6.22). In particular, it follows from (6.21) that the mean a = Eé and 
variance o* = Dé are given by the formulas 


—if{0),  o* = —Fx0) + LAO. (6.23) 


Example 2. According to (6.23), the normally distributed mandom variable 
€ with probability density (6.17) has mean 


a=—if'(0)=0 
ot = —f"(0) =1. 


and variance 


Formula (6.7) has a natural analogue for characteristic functions. In 
fact, if 2,,..., &, are independent random variables with sum & = & + 
+ ,, then, by formula (4.20), p. 47, the characteristic function of & is 


St) = fe) >» * fen (6.24) 

Let &,,m =1,2,... be a sequence of random variables with character- 
istic functions f,(t), 1 = 1,2,... Then the sequence of probability distri- 
butions of &,, &,... is said to converge weakly to the distribution with 


density p(x) if 
lim P {x’ < 8, < x"} =|” p(x) dx 


for all x’ and x” (x’ < x”). This should be compared with the definition of 
weak convergence for discrete random variables taking the values 0, 1, Ds 
given in Sec. 13. 

Theorem 6.2 has a natural analogue for characteristic functions, whose 
proof will not be given here:$ 


THEOREM 6.2.’ The sequence of probability distributions with charac- 


teristic functions f,(t), n =1,2,... converges weakly to the limiting 
distribution with density p(x) if and only if 
lim f,(t) = f(0), (6.11’) 
no 


® For the proof, see e.g., B. V. Gnedenko, op. cit., Sec. 38. 
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where 
f(0) =|" p(x) dx 


is the characteristic function of the limiting distribution and the con- 
vergence is uniform in every finite interval t' < t < t". 


We now prove a key proposition of probability theory, called the central 
limit theorem, which has the De Moivre-Laplace theorem (Theorem 5.1, 
p- 59) as a very special case. Roughly speaking, the central limit theorem 
asserts that the distribution of the sum of a large number of independent 
identically distributed random variables is approximately normal. 


DEFINITION. Given a sequence of random variables =,, k = 1,2,... 
with finite means a, =, and variances o2 = Dé,, consider the 
“normalized sum” 


sta Sn = ES, 
n Ds. , 
where 
S, =D 
k=1 
Then the sequence ,, k =1,2,... is said to satisfy the central limit 


theorem if? 


Li) fetes 
lim P {x' < S* < x"} = —— ee 8 dx, 6.25 
na sites Jan is aa 


THEOREM 6.3. Suppose the sequence of independent random variables 


Ee» kK = 1,2,... with means a, and variances o? satisfies the Lyapunov 
condition 
lim - > Elé, — a,/° = 0, (6.26) 
n>0 By k=1 
where 


n 
2 2 
By = DS; =") oj. 
k=l 
Then the sequence of random variables satisfies the central limit theorem. 


Proof. Equation (6.25) means that the sequence of distributions 
of the normalized sums oes n=1,2,... converges weakly to the 
normal distribution with probability density (6.16). Hence, according 
to Theorem 6.2’, we need only show that the sequence of characteristic 


7 Cf. formula (5.8), p. 59. Note that the right-hand side of (6.25) equals D(x”) — O(x’), 
iy ig} q' 


where ®(x) is the distribution function of a normal random variable with mean 0 and vari- 


ance 1. 
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functions /,(t), 2 =1,2,... of the random variables S* converges 
uniformly in every finite interval t’< ¢< 1" to the characteristic 
function f(t) = e~’/? of this normal distribution (recall Example 1). 
Clearly, 


n nas 
sty %=% pt _ ps, 
k= 6B, 


The random variable &,— a, has zero mean and variance o;, and 
hence, by (6.21), has characteristic function 


2 
g(t) =1— rad + R,(0), 


where 
|Ry(t)| < C[t E|E, — ayl* 


(C is some constant). Therefore the characteristic function of the 
random variable 4, = (2, — )/By is 


t cH t 
N= hes: -1- Shr +R,(5), 
Sun(t) (5) 2B? ke B, 


t 
R,{— 
| (ge) 
It follows from (6.24) that the random variable S*¥ = 4; +°** + Mn 
has characteristic function 


where 


E\& — al’ 
Spee 
i" 


Fall) = TL fen 
k=1 


Hence 
< < of t 
Inf) =S foal) ~S . Sh 8 n,(5)]: 
k=1 k=1 


where, because of the hypothesis (6.26), 


. t ea 
2%, (5) = D Ele — a,\? +0 
k=1 B,, 


Be ima 


as n — © uniformly in every finite interval t’ < t < t”. Therefore 


<C\|ti* 


? n A ? 
Se. 2. 


or equivalently 


asn>o, 
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Example 3. The Lyapunov condition is always satisfied if the random 
variables €, &,.. . are identically distributed and if « — E |&, — a,|° exists. 
In fact, 


where o*? = D&,, and hence ee 
Ra alll ate reread lif) 
lim E 5 lim — 0. 
n+ BB 1 lex — axl no /n o® 
PROBLEMS 


1. Show that the conclusion of Theorem 6,1 can be written in the form 
< ‘| =1 


2. Let &,..., & be m independent identically distributed random variables, 
with common mean a = Eé, and variance &? = DZ,. Suppose a is known. Can 
the quantity 


tin ( de, -—a 
Nkt 


no 


for arbitrary « > 0. 


(5 — a)® 


1 


zi 
Ms 


k 
be used to estimate o®? 


3. A random variable & has probability density* 
~m 
—e* if x>0, 
pe(x) = 4m 
re 0 otherwise, 
where m is a positive integer. Prove that 
P{0<& <2%m+ I} > 


Hint. Use Chebyshev’s inequality. 


m+1 


4. The probability of an event A occurring in one trial is 4. Is it true that the 
probability of A occurring between 400 and 600 times in 1000 independent 
trials exceeds 0.97? 


Ans. Yes. 
5. Let & be the number of spots obtained in throwing an unbiased die. What is 


the generating function of £? 


* It follows by repeated integration by parts that i xme* dx = m! 
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6. Use (6.6) and the result of the preceding problem to solve Problem 16, p. 52. 


7. Let = be a random variable with the Poisson distribution 


Px(k) = b= 0,0, 25055 (6.27) 


Use (6.6) to show that EZ = DE =a, 
8. Find the generating function of the random variable with distribution 


ak 


PEM aya 


(a > 0). 


Use (6.6) to find Eé and Dé. 


9. Let 1 be the sum of two independent random variables & and &, one with 
the Poisson distribution (6.27), the other with the Poisson distribution obtained 
by changing the parameter a to a’ in (6.27). Show that 7 also has a Poisson 
distribution, with parameter a + a’. 


10. Let S, be the number of successes in a series of n independent trials, where 
the probability of success at the kth trial is py. Suppose py, ... , Pn depend on n 
in such a way that 


[oie ad tan 
while 


max {py,.+++Pn} > 0 


as n> ©. Prove that S,, has a Poisson distribution with parameter 4 in the 
limit as n + ©, 


Hint, Use Theorem 6.2.° 


11. Find the characteristic function f; (t)of the random variable with probability 
density 


1 
pz(x) mieael (-—0 <x < o), 
Ans. fe(t) = Teh 


12. Use (6.23) and the result of the preceding problem to solve Problem 13, 
p. 52. 


13, Find the characteristic function of a random variable uniformly distributed 
in the interval [a, 6]. 


14. A continuous random variable & has characteristic function 
ft) =e ~=(@ > 0). 
Find the probability density of &. 


® For the details, see W. Feller, op. cit., p. 282. 
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a 
Ans. p¢(x) = at + a8) . 


15. The derivatives fe(0) and fe (0) do not exist in the preceding problem. Why 
does this make sense? 
Hint. Cf. Problem 24, p. 53. 


16. Let v be the total number of spots which are obtained in 1000 independent 
throws of an unbiased die. Then Ev = 3500, because of Problem 16, p. 52. 
Estimate the probability that v is a number between 3450 and 3550. 


a 
17. Let S,, be the same as in Problem 10, and suppose > p,q, = ©. Prove that 
k=l 


Sn — D Pr tae 
Pix! < —SE— < x"\ + [le dx 
. Vin J 
[Spe 
kot 


asn-—> 0, 


Hint. Apply Theorem 6.3. 


t) 


MARKOV CHAINS 


15. Transition Probabilities 


Consider a physical system with the following properties: 

a) The system can occupy any of a finite or countably infinite number of 
States €,, &,... 

b) Starting from some initial state at time t = 0, the system changes its 
state randomly at the times t = 1, 2,... Thus, if the random variable 
E(t) is the state of the system at time f,! the evolution of the system in 
time is described by the consecutive transitions (or “steps’’) 


50) = (1) 6): 
c) At time t = 0, the system occupies the state ¢, with initial probability 
pPe=P{&O)=e}, i=1,2,... (7.1) 


d) Suppose the system is in the state ¢, at any time n. Then the proba- 
bility that the system goes into the state <, at the next step is given by 


Pu =PEM+)=e|E@=e}, if=1,2,..., (7.2) 


regardless of its behavior before the time n. The numbers p,;, called 
the transition probabilities, do not depend on the time n. 


1Tn calling &(t) a random variable, we are tacitly assuming that the states ¢,, ¢:,... 
are numbers (random variables are numerical functions). This can always be achieved by 
the simple expedient of replacing ¢,, ¢2,. . . by the integers 1, 2,.. . (see W. Feller, op. cit., 
p. 419). 
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A “random process” described by this model is called a Markov chain.* 
Now let 

P(r) = P {E(”) = 5} (7.3) 

be the probability that the system will be in the state ¢, “after m steps.” 

To find p,(n), we argue as follows: After n — 1 steps, the system must be in 


one of the states ¢,, k = 1,2,..., ie., the events {E(n —1) =e}, k= 
1,2,... form a full set of mutually exclusive events in the sense of p. 26. 


Hence, by formula (3.6), 
P {&(n) = ¢,} PEM) e,|&(n — 1) =e} P{E(n — 1) = 5}. (7.4) 


Writing (7.4) in terms of the notation (7.1)-(7.3), we get the recursion 
formulas 


p(0) = p, 
p(n) = x Pil Drs, n=1,2,..- (7.5) 


If the system is in a definite state ¢, at time f = 0, the initial probability 
distribution reduces to 


plies ly py ik ake de. (7.6) 
The probability p,(n) is then the same as the probability 
Pal) = P(E) = ¢,|20) =e}, 1, 7=1,2,... 


that the system will go from state ¢, to state ¢, in m steps. Hence, for the 
initial distribution (7.6), the formulas (7.5) become 


ee J i5 
0) ak iseis (7.7) 
Pin) = 2 Pix(h — 1)Pp5p n=1,2,... 


Pii(0) = 


The form of the sum in (7.7) suggests introducing the transition probability 
matrix 
Pui Piz 
P= |pyl =|)?" Ps 


2 More exactly, a Markov chain with stationary transition probabilities, where we allude 
to the fact that the numbers (7.2) do not depend on n. For an abstract definition of a 
Markoy chain, without reference to an underlying physical system, see W. Feller, op. cit., 
p. 374. 
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and the “n-step transition probability matrix” 
Pul™) Pr(n) 
P(n) = |pi(n)|| = Pal”) Poo(n) 


Then, because of the rule for matrix multiplication, (7.7) implies 
PO) =7, PO) =P, PQ)=P()P=P,..., 


where / is the unit matrix (with ones along the main diagonal and zeros 
everywhere else). It follows that 


P(n) =P", Sale ema (7.8) 


Example 1 (The book pile problem). Consider a pile of m books lying 
on a desk. If the books are numbered from 1 to m, the order of the books 
from the top of the pile down is described by some permutation (i, ip,... , 
i,,) of the integers 1,2, .. . , m, where i, is the number of the book on top of 
the pile, i, the number of the next book down, etc., and i,, is the number of 
the book at the bottom of the pile. Suppose each of the books is chosen with 
a definite probability, and then returned to the top of the pile. Let p, be 
the probability of choosing the kth book (k = 1,2,...,m), and suppose 
the book pile is in the state (i,, i,,...,i,,). Then, at the next step, the state 
either remains unchanged, which happens with probability p,, when the top 
book (numbered i,) is chosen, or else changes to one of the m — 1 states 
of the form (i,,i;,...), which happens with probability p,, when a book 
other than the top book is chosen. Thus we are dealing with a Markov chain, 
with states described by the permutations (i,, i,,...,%,,) and the indicated 
transition probabilities. 

For example, if m= 2, there are only two states e, = (1, 2) and e, = 
(2, 1), and the transition probabilities are 


Pies Pars Pin Pie =P as'= Pa 
The corresponding transition probability matrix is 
Pi Pe 
Pri Pe 
The “two-step transition probabilities” are 


Pu2) = Par(2) = paps + paps = Pr(Pr + Po) = Pr» 
Pra(2) = Paa(2) = pps + Pops = Po(Pr + Po) = Po 


jos 


n 
T 
n 
T 


* Suitably generalized to the case of infinite matrices, if there are infinitely many states 
61, €,... 
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Hence P? = P, and more generally P" = P. Given any initial probability 
distribution p?, p3, we have 


P2) = Pr, 
po) = Po. 


Example 2 (The optimal choice problem). Returning to Example 2, p. 28, 
concerning the choice of the best object among m objects all of different 
quality, let ¢, (k = 1,2,...,m) be the state characterized by the fact that 
the kth inspected object is the best of the first k objects inspected, and let 
Em41 be the state characterized by the fact that the best of all m objects has 
already been examined and rejected. As the m objects are examined one by 
one at random, there are various times at which the last object examined 
turns out to be better than all previous objects examined. Denote these times 
in order of occurrence by t=0,1,...,¥, with t= 0 corresponding to 
inspection of the first object and ¢ = v being the time at which the best of all 
m objects is examined (v = 0 if the best object is examined first). Imagine a 


pi(n) PiPu(n) t P2Pa1(n) PilPS 
a(n) = PrPro(n) + P2Pao(n) = pol Pi 


1 
T 
n 
T 


system with possible states €,,..., &m) m1, and let E(t) be the state of the 
system at the time ¢, so that in particular E(0) =e,. To make the “random 
process” E(0)—» &(1) > &(2)-+ +++ into a Markov chain, we must define 


E(n) for n > v. This is done by the simple artifice of setting &(n) = &41 for 
alln > v. 

The transition probabilities of this Markov chain are easily found. 
Obviously Psijmy1 = 1 and py =0 if i> j, j < m. To calculate p,; for 
i<j <m, we write (7.2) in the form 


P(EEs) 

P(E) ” 
in terms of the events E, = {&(n) = ¢,} and E, = {E(n + 1) = «;}. Clearly, 
P(E;) is the probability that the best object will occupy the last place in a 
randomly selected permutation of j objects, all of different quality. Since the 
total number of distinct permutations of j objects is j!, while the number of 
such permutations with a fixed element in the last (jth) place is (j — 1)!, 
we have 


Pi; = P(E; | E) = (7.9) 


(j=! _ 1 
i! je 


P(E, = fi Aye en (7.10) 
Similarly, P(Z,E;) is the probability that the best object occupies the jth 
place, while a definite object (namely, the second best object) occupies the 
ith place. Clearly, there are (j — 2)! permutations of j objects with fixed 
elements in two places, and hence 


6 =?) ee 


P(E,E,) = j! = Gay =i > 


i<j<m. (7.11) 
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It follows from (7.7)-(7.11) that 
LO 
Ge 


As for the transition probabilities p;,,,,1, they have in effect already been 
calculated in Example 2, p. 28: 


Pj i<j<m. 


i ; 
Pim =—> P=, ail, 
m 


Example 3 (One-dimensional random walk). Consider a particle which 
moves randomly along the x-axis, coming to rest only at the pointsx = ..., 
—2, —1,0,1,2,...with integral coordinates. Suppose the particle’s motion 
is such that once at a point i, it jumps at the next step to either the point 
i+ 1 or the point i — 1, with probabilities p and g = 1 — p, respectively.‘ 
Let E(m) be the particle’s position after n steps. Then the sequence &(0) > 
=(1) — £(2) +--+ is a Markov chain with transition probabilities 


pif =it+l, 
Pa=\q if f=i—1, (7.12) 


0 otherwise. 


In another kind of one-dimensional random walk, the particle comes to 
rest only at the points x = 0, 1, 2,... , jumping from the point / to the point 
i+ 1 with probability p, and returning to the origin with probability q, = 
1 — p,. The corresponding Markov chain has transition probabilities 


Piney tetat, 
Pu={\% if j=0, (7.13) 
0 otherwise. 


16. Persistent and Transient States 


Consider a Markov chain with states ¢,, ¢,,... and transition probabi- 
lities p,;, i,j = 1, 2,... Suppose the system is initially in the state e,. Let 


Un = PulN), 
and let v, be the probability that the system returns to the initial state ¢, 


* Thus the particle’s motion is “generated” by an infinite sequence of Bernoulli trials 
(cf. the example on pp. 63-65, where p = g = 4). 
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for the first time after precisely n steps. Then 


Un = Uy + Uy + °° * + Ups + Uno, BSAG2 5 sioasy 

(7.14) 
where we set 

p= 1, vo = 0 

by definition. To see this, let B, (k =1,...,m) be the event that “the 
system returns to ¢, for the first time after k steps,” B,,, the event that “the 
system does not return at all to ¢, during the first n steps,” and A the event 
that “‘the system is in the initial state ¢, after m steps.” Then the events 
B,,..., By, Byy, form a full set of mutually exclusive events, and hence, by 
the “total probability formula’’ (3.6), p. 26, 


n+l 
P(A) = P(A | B,)P(B,), (7.15) 
=1 
where clearly P(A | B,,1) = 0 and 
P(B)=%, P(A|B)=t9 k= 1,-0.0 


Substituting these values into (7.15), we get (7.14). 
In terms of the generating functions® 


U(z) = Su,zt, V(z) = Sz, |z| <1, 
k=0 k=0 
we can write (7.14) in the form 
U(z) — M = UZ)V(Z), uw =1, 
which implies 
1 


Ors ye 


(7.16) 
The quantity 


ES 
Ve ty 


n=0 
is the probability that the system sooner or later returns to the original state 
e, The state ¢, is said to be persistent if v = 1 and transient if v < 1. 


THEOREM 7.1. The state ¢, is persistent if and only if 
2 co 
> Un = > Pulm) = . (7.17) 
n=0 n=0 
5 Although the numbers up, 43, U2, . .. do not correspond to a probability distribution 
ey 
as on p. 70 (in fact, we will consider the case where > u, = ©), we continue to call U(z) 
k=0 


o 

a “generating function.” The convergence of the series > u,z* for |z| < 1 follows by com- 
k=0 

parison with the geometric series, since || < 1 for every k. 
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Proof. To say that ¢, is persistent means that 


lan ete 


e n=0 aod 
or equivalently, 
lim U(z) = lim —*— = co 
291 z+1 1 — V(z) 
Suppose 
oo 
DER = Beh (7.18) 
n=0 


Then, since the w,, are all nonnegative, 


x © 
du, < lim Uz) < Yu, 


n=0 gl n=0 


for every N, and hence, taking the limit as N — 00, we have 


o 
lim U(z) = Duy. 
a1 n=O 
In other words, U(z) approaches a finite limit as z—> 1 if and only if 
(7.18) holds. Equivalently, U(z) > oo as z— 1, i.e., e, is persistent, if 
and only if (7.17) holds. J 


THEOREM 7.2. If the initial state , is persistent, then with probability 
1 the system returns infinitely often to ¢, as the number of steps n> «, 
If =, is transient, then with probability 1 the system returns to «, only 
finitely often, i.e., after a certain number of steps the system never again 
returns to ;. 

Proof. Suppose the system first returns to ¢, after v, steps, returns 
a second time to ¢, after v, steps, and so on. If there are fewer than k 
returns to €, as n —> 00, we set v, = 00. Then the event {v, < co} means 
that there are at least k returns to ¢,, and the probability of the system 
returning to ¢, at least once is just 

P {vy < co} =v, 


If the event {v,; < 00} occurs, the system returns to its initial state ¢, 
after v, steps, and its subsequent behavior is the same as if it just started 
its motion in ¢,. It follows that 


P {v,< |v, < o} =». 


Clearly v, = 0 implies v, = 00, and hence vz < oo implies vy; < 00. 
Therefore 


P {v, < 0} = P {vy < |v, < co}P {v, < co} = v?, 
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and similarly, 
P{y,< 0|y.1< of =v, P{y, < 0} = v4, 
If e; is transient, then v < 1 and hence 
XP {y.< wo} => v* < ow, 
k=1 k=1 
Therefore, by the first Borel-Cantelli lemma (Theorem 2.5, p. 21), with 
probability 1 only finitely many of the events {v, < 00} occur, i.e., with 
probability 1 the system returns to the state ¢, only finitely often. This 
proves the second assertion in the statement of the theorem. 
On the other hand, if ¢, is persistent, then v = 1, which implies 
P{y,< of = 1 
for every k. Let x be the number of times the system returns to its initial 
state ¢, asm — oo, Then obviously the events {x > k} and {v, < 00} are 
equivalent, so that if P {v, < oo} = 1 for every k, then x exceeds any 
preassigned integer k with probability 1. But then 
Pix = cobaiy 
which proves the first assertion. [J 
A state ¢, is said to be accessible from a state ¢, if the probability of the 


system going from e¢, to ¢, in some number of steps is positive, i.e., if 
Pi(M) > 0 for some M. 


THEOREM 7.3. If a state e, is accessible from a persistent state ¢,, 
then &, is in turn accessible from €, and «; is itself persistent. 


Proof. Suppose ¢, is not accessible from ¢,. Then the system will go 
from ¢; to ¢, with positive probability p,,(M) = « > 0 for some number 
of steps M, after which the system cannot return to ¢,. But then the 
probability of the system eventually returning to ¢; cannot exceed 1 — a, 
contrary to the assumption that ¢, is persistent. Hence ©, must be acces- 
sible from ¢,, i.e., p;,(N) = 8 > 0 for some N. It follows from (7.8) that 

Pin + M + N) = P(M)P@)P(N) = P(N)P@)P(M), 
and hence 


Pilt + M+ N)> py(M)pj(™)pyi(N) = a Bp;;(), 
Pi(n + M + N)> pj(N)pi™)pi(N) = «Bp y(n). 


These inequalities show that the series 


Sram, — S pyfn) 
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either both converge or both diverge. But 


¥ pn) = 
n=0 


by Theorem 7.1, since ¢, is persistent. Therefore 


« 
2, Pisln) 108; 
ie., <, is also persistent (again by Theorem 7.1). 


CoroLtary. Jf a Markov chain has only a finite number of states, 
each accessible from every other state, then the states are all persistent. 


Proof. Since there are only a finite number of states, the system 
must return to at least one of them infinitely often as n + 00. Hence at 
least one of the states, say ¢,, is persistent. But all the other states are 
accessible from ¢,. It follows from Theorem 7.3 that all the states are 
persistent. fj 


Example 1. In the book pile problem (Example 1, p. 85), if every book 
is chosen with positive probability, i.e., ifp, > 0 for alli=1,...,m, then 
obviously every state is accessible from every other state. In this case, all 
m! distinct states (i;,...,i,,) are persistent. If p; = 0 for some /, then all 
states of the form (i,,..., i) where i, =i (the ith book lies on top of the 
pile) are transient, since at the very first step a book with a number / different 
from / will be chosen, and then the book numbered i, which can never be 
chosen from the pile, will steadily work its way downward. 


Example 2. In the optimal choice problem (Example 2, p. 86), it is 
obvious that after no more than m steps (m is the total number of objects), 
the system will arrive at the state ¢,,,;, where it will remain forever. Hence 
all the states except ¢,,,, are transient. 


Example 3. Consider the one-dimensional random walk with transition 
probabilities (7.12). Clearly, every state (i.e., every position of the particle) 
is accessible from every other state, and moreover® 


0 if k=2n+1, 
Gepige ait ‘k= an. 
Using Stirling’s formula (see p. 10), we have 


(2n)! nn  J4mn(2n)"e*" 1 fk 
(a? Pq (tan n"e-" Pq Fe 


* Cf. formula (5,2), p. 55. 


Pik) = 


C2"p"q n 
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for large n, where 


4pq = (p+ 9)? —-(P—-Q?=1-@—gr<l 
(the equality holds only for p = q = 4). Therefore 


1 
Pi(2n) ~ Jan (4pq)" 
for large n, and hence the series 
> pi(2n), (4pq)" 
n=0 n=0 Se 


either both converge or both diverge. Suppose p 4 q, so that 4pq < 1. Then 


2 Pu(2n) <0, 


and hence every state is transient. It is intuitively clear that if p > q (say), 
then the particle will gradually work its way out along the x axis in the posi- 
tive direction, and sooner or later permanently abandon any given state /. 
However, if p = q = 4, we have 


¥ pu(2n) bt 


n=0 
and the particle will return to each state infinitely often, a fact apparent from 
the symmetry of the problem in this case. 


Example 4. Next consider the one-dimensional random walk with 
transition probabilities (7.13), Obviously, if0 <p, <1foralli=0,1,..., 
every state is accessible from every other state, and hence the states are either 
all persistent or all transient. Suppose the system is initially in the state i = 0. 
Then the probability that it does not return to the state i = 0 after n steps 
equals the product pop; ***Py—1, the probability of the system making the 
consecutive transitions 0» 1-—+-++—>n. It is easy to see that the proba- 
bility that the system never returns to its initial state i = 0 as n + oo equals 
the infinite product 


iu Pn = lim PoPy*** Pn: 


n+0 
If this infinite product converges to zero, i.e., if 
lim PoPi *** Pn = 0, 
n+0 
then the state i=0 is persistent, and hence so are all the other states. 
Otherwise, the probability of return to the initial state is 
v = 1 —lim pop, ‘++ p, <1. (7.19) 
nc 
Then the state i = 0 is transient, and hence so are all the other states, 
We can arrive at the same result somewhat differently by direct calculation 
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of the probability v,, that the particle first returns to its initial state 7 = 0 in 
precisely m steps. Obviously, v,, is just the probability of the particle making 
the consecutive transitions 0 > 1 >--:—»n — 1 in the first n — 1 steps and 
then returning to the state i = 0 in the nth step. Therefore, since the transi- 
tion i — 1 ~ihas probability p,_,, 

v= 1— pp, 

Un = PoPr*** Pn—v(l — Pn), n=2,3,... 
By definition, the probability of eventually returning to the initial state i = 0 


is a 
Ee ee 
n=0 
Therefore 
p= 1 — po+ poll — pi) + Pops(l — Pa) + °** = 1 —lim popy* ++ Pn» 


nc 


in keeping with (7.19). 


17. Limiting Probabilities. Stationary Distributions 


As before, let p;(m) be the probability of the system occupying the state 
e, after m steps. Then, under certain conditions, the numbers p,(n), 7 = 1, 
2,... approach definite limits as n — oo: 
THEOREM 7.4. Given a Markov chain with a finite number of states 
©, ---5 £m, each accessible from every other state, suppose 


min p,,(N) = 8>0 (7.20) 
i 
for some N.7 Then 
lim p,(n) = p}, 
no 
where the numbers p¥,j =1,...,m, called the limiting probabilities, 
do not depend on the initial probability distribution and satisfy the in- 
equalities 
max |p,(n) — p3| < Ce", [pn) — pf] < Ce?" (7.21) 
i 
for suitable positive constants C and D. 


Proof. Let 
rn) = na P(r), Rin) = max p,,(n). 


* In other words, suppose the probability of the system going from any state e; to any 
other state ¢; in some (fixed) number of steps N is positive. 

* Clearly, the numbers p¥ are nonnegative and have the sum 1 (why ?). Hence they are 
candidates for the probabilities of a discrete probability distribution, as implicit in the 
term “limiting probabilities.” 
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Then 


m m 
r(n+1)= min p(n + 1) = min & PxPrs(”) > iain 2 Pat (n) =r,(n), 
= ‘ 


m m 
Rea) imax pat tl) = max & ParPrs(”) < max > PixRj(n) = Rin), 
and hence : 7 
ri(1) < 152) <0 < y(n) < +++ < Ri) < ++ *'< RQ) < R,(1). 
Let N be the same as in (7.20). Then, for arbitrary states ¢, and g, 
m m 
2 Pal) = Par) =e 
Therefore - 5 
m m 
py Pax(N) — > Pox(N) 
k=1 k=l 
= 2*TPa(N) = Pox(N)] a5: 2 TPN) = Pox(N)] =0, 
where the sum }* ranges over all k such that p..(N) — pgx(N) > 0 and 
> ranges over all k such that p,,(N) — pg,(N) < 0. Clearly, (7.20) 
implies 
a 2" [Pex(X) — Pa(N)] = 4 <1, 
a, 
for some positive number d. 
Next we estimate the differences R,(n) — rj(n) and R,(n + N) — 
rj(n + N): 
R,(N) — r,(N) = max p,,(N) — ate pe,(N) 
a 


= [p..,(N) — pa,(N)] 
< sees 2 [Pa(N) — Pe(N)] = 4, 


Rn + N) — rj(n + N) ae [Pas(n + N) — Pas(n + N)] 


= iar = [Pax(N) — Pox(N)1Pxs(n) 
< max (= [Pa(N) — Ppx(N)IR(n) 
+ 3a) — Palin] 


= max De [Pan(N) — Pox(NYITR (nm) — rm} 
= d[R,(n) — r,(n)]. 
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It follows that 


R(KN) —rj(KN)<d*, k=1,2,... (7.22) 
But, as already noted, the sequence rj(m), n=1,2,... is nonde- 
creasing while the sequence R,(n), n = 1, 2,... is nonincreasing, and 


moreover r,(n) < R,(n). Hence (7.22) shows that both sequences have 
the same limit 
pt =limr,(n) = lim R,(n). 


n> 


Moreover, it is clear that 


pis) — pfl < Rin) — rn) < a", +.,m. (7.23) 
Therefore, given any initial distribution p?, i=1,...,m, we have 
m m 
\pAn) — pil =| > Pir.n) — ph | =| > pilpsn) — Pf] 
A ge a (7.24) 
< > puRAn) — r(n)] = Rn) — rn) < dd <1. 
f=1 
But then 
lim |p,(n) — p}| = 0, 
n> o2 
i.e., 
lim p,(n) = pt 
no 
independently of the initial distribution, as asserted. Choosing 
= Hl D=— L Ind 
d N 


in (7.23) and (7.24), we get (7.21). f 


Corottary. The limiting probabilities p*, j=1,..., m are a 
solution of the system of linear equations 


m 
Di Pedi: i Laci gittte (7.25) 
i=1 
Proof. According to (7.5), 
pn) = X pin — IPs. 


But this becomes (7.25), after taking the limit asn—> 0. Jf 
Remark. Given an arbitrary Markov chain with states ¢, &:,..., let 
po, i=1,2,... be numbers such that 
1 
and 
P= » Bp) d= leet (7.26) 
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Choosing p?, i = 1, 2,... as the initial probability distribution, we calculate 
the probability p;(n) of finding the system in the state e; after n steps, obtain- 


ing 
PAl) = > pips = Pi 
‘ 


p2) = 2 Pi(lPus = > pipis = Pi, 
; 


It follows that 


pin) =p}, f=1,2,... (7.27) 
for alln =0, 1, 2,... [p,(0) = p? trivially], i.e., the probabilities p,(n), 
4 i y. P P 
j=1,2,... remain unchanged as the system evolves in time. 


A Markov chain is said to be stationary if the probabilities p,(n), j = 1, 
2,...remain unchanged for all = 0,1, 2,..., and then the corresponding 
probability distribution with probabilities (7.27) is also said to be stationary. 
It follows from the corollary and the remark that a probability distribution 
P},j =1,2,... is stationary if and only if it satisfies the system of equations 
(7.26). Moreover, if the limiting probabilities 

pf =lim p{n) (7.28) 
n+0 
are the same for every initial distribution, then there is a unique stationary 
distribution with probabilities 


Py= Pf, j=l,2,... 
Hence Theorem 7.4 and its corollary can be paraphrased as follows: Subject 
to the condition (7.20), the limiting probabilities (7.28) exist and are the unique 
solution of the system of linear equations (7.25) satisfying the extra conditions 


m 
pP>0,  Sprat 


j=l 
Moreover, they form a stationary distribution for the given Markov chain. 


Example 1. In the book pile problem, it will be recalled from p. 86 that 
when m = 2, the stationary distribution 
Pit) = Pr, Pol") = Pr 
is established at the very first step. In the case of arbitrary m, let 
Pliqs-sstm) Gys-+-1) Aenote the probability of the transition from the state 
(i1,.++,5%m) to the state (j;,..-,jm), and assume that the probabilities 
Pi, +++ Pm are all positive. Then, as shown on p. 85, 


Pi, NEG Oy codes (I ool CPaL reso 


Pe p6cs sot Bice sade) ae ; 
Ue eee 0 otherwise, 
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where the permutation (i,, i, .. .) is obtained from (i, ... , i,) by choosing 
some i, and moving it into the first position. The limiting probabilities 
Pan. j) are the solution of the system of linear equations 


Pots crete) = Pia, > P PG. aor a (7.29) 
(Ifs.005) 
where (j{,... ,j;,) ranges over the m permutations 


(jas Jafar +++ >dm)s (jas dasa» +++ sdm)s +++ Gastar «++ sdmsJx) 


which give (j,,... ,j,) when j, is moved into the first position. 

After a sufficiently large number of steps, a stationary distribution will be 
virtually established, i.e., the book pile will occupy the states (i,,..., im) 
with virtually unchanging probabilities p/,__ ,,,). Clearly, the probability of 
finding the ith book on top of the pile is then : 


Ppt= aa Diemer 
i 
and hence, by (7.29), 


where (i{,...,i,,) ranges over the m permutations 
(i, fing ig, + + + dyn)s (lay by fgg «+ + 5 hm)s ++ 5 (lay tay + + +» tins b) 


which give (i, ij,..., i) when i is moved into the first position. But then 


Pi =P; SS Poy. scsi) = Ps i=1,...,m, 


ties ce tm 
i.e., the limiting probability p* of finding the ith book on top of the pile 
is just the probability p; with which the ith book is chosen. Thus, the more 
often a book is chosen, the greater the probability of its ending up on top 
of the pile (which is hardly surprising!). 


Example 2. Consider the one-dimensional random walk with transition 
probabilities (7.12). If p Aq, then the particle gradually moves further and 
further away from the origin, in the positive direction if p > q and in the 
negative direction if p <q. If p = gq, the particle will return infinitely often 
to each state, but for any fixed j, the probability p, (nm) of the particle being 
at the point j approaches 0 as n > oo (why?). Hence, in any case, 

lim p,(n) = pf = 0 

n+ 
for every j, but the numbers p;*, j = 1, 2,... cannot be interpreted as the 
limiting probabilities, since they are all zero. In particular, there is no sta- 
tionary distribution. 
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Example 3. Finally, consider the one-dimensional random walk with 
transition probabilities (7.13). Suppose 


lim poPy*** Pp =1—0>0, (7.30) 
no 


so that the states are all transient (see p. 92), Then as n> ©, the particle 
“moves off to infinity” in the positive direction with probability 1, and there 
is obviously no stationary distribution. If there is a stationary distribution, 
it must satisfy the system of equations (7.26), which in the present case take 
the form 

PS = PyaPia» Ay 2 Nees (7.31) 


It follows from (7.31) that 


PL = PoPo — P2 = PoPoPas+++> Pn = PoPoP1*** Pn—t»++> 
Clearly a stationary distribution exists if and only if the series 


2 PoPs*** Pa = 1 + Pot Popa tb °° (7.32) 
oi 
converges.® The stationary distribution is then 
1 
oO 
Po Sere as aSteee aa ae 
: 1+ Po+ PoP: +*** 
0 PoPi*** Pn 
Pn = ’ es So 
"DE Po + Popnt 


PROBLEMS 


1. Anumber from 1 to mis chosen at random, at each of the times t = 1,2,... 
A system is said to be in the state ey if no number has yet been chosen, and in the 
state ¢; if the largest number so far chosen is 7. Show that the random process 
described by this model is a Markov chain. Find the corresponding transition 
probabilities p;; (i, 7 = 0, 1,..., m). 


Ans. 1g tele Py =Oifi >j, Pa = 7 ifi<j, 


2. In the preceding problem, which states are persistent and which transient? 


3. Suppose m = 4 in Problem 1. Find the matrix P(2) = ||p;;(2)\, where 
Pi;(2) is the probability that the system will go from state ¢; to state ¢; in 2 steps. 


* Note that (7.32) automatically diverges if (7.30) holds. 
‘4 8 
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4. An urn contains a total of N balls, some black and some white, Samples are 
drawn from the urn, m balls at a time (m < N). After drawing each sample, 
the black balls are returned to the urn, while the white balls are replaced by 
black balls and then returned to the urn. If the number of white balls in the urn 
is /, we say that the “system” is in the state e;. Prove that the random process 
described by this model is a Markov chain (imagine that samples are drawn at 
the times f = 1, 2,... and that the system has some initial probability distribu- 
tion). Find the corresponding transition probabilities p;; (i, 7 = 0, 1,..., N). 
Which states are persistent and which transient ? 


Ans. pi; =O if i<jorifi>j,j>N—m, 


py =H it i> jj <N—m. 


m 
The state ey is persistent, but the others are transient. 
5. In the preceding problems, let N = 8, m = 4, and suppose there are initially 


5 white balls in the urn. What is the probability that no white balls are left after 
2 drawings (of 4 balls each)? 


6. A particle moves randomly along the interval [1, m], coming to rest only at 
the points with coordinates x = 1,...,m. The particle's motion is described 
by a Markov chain such that 


Pu =i, Puma = 1, 
Pitt = Po P-r=9 = (J =2,---,m—1), 
with all other transition probabilities equal to zero. Which states are persistent 
and which transient? 


7. In the preceding problem, show that the limiting probabilities defined in 
Theorem 7.4 do not exist. In particular, show that the condition (7.20) does not 
hold for any N. 


Hint. py(n) = 0 if nis odd, while p,.(n) = 0 if n is even. 
8. Consider the same kind of random walk as in Problem 6, but now suppose 
the nonzero transition probabilities are 
Pu = 4 Pmm =P» 
Prijs =P» PHA =F (j =1,..-,m), 


permitting the particle to stay at the points x = 1 and x = m. Which states 
are persistent and which transient? Show that the limiting probabilities 
py, .... p*, defined in Theorem 7.4 now exist. 


9. In the preceding problem, calculate the limiting probabilities p¥,... , pj. 
Ans, Solving the system of equations 
pi =49pi + 92, 
PS =PPha tops (f= 2,....m—I), 
Pm = PPm—1 + PPm 
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we get 
py 
ot - (2) PE (j= 1,..., m0) 
Therefore 
1 
Pas 
if p = q, while 
ala) (y" 
Ye (play \q 


m 
if p # q (impose the condition that } p¥ = 1). 
j=1 


10. Two marksmen A and B take turns shooting at a target. It is agreed that A 
will shoot after each hit, while B will shoot after each miss. Suppose 4 hits the 
target with probability « > 0, while B hits the target with probability @ > 0, 
and let n be the number of shots fired. What is the limiting probability of hitting 
the target as n = 0? 


eer lees 
SOE 


11. Suppose the condition (7.20) holds for a transition probability matrix 
whose column sums (as well as row sums) all equal unity. Find the limiting 
probabilities p#,..., p*. 


Ans. 


12. Suppose m white balls and m black balls are mixed together and divided 
equally between two urns. A ball is then drawn at random from each urn and 
put into the other urn. Suppose this is done » times. If the number of white 
balls in a given urn is j, we say that the “‘system”’ is in the state e; (the number 
of white balls in the other urn is then m —/). Prove that the limiting prob- 
abilities pf, pee .»+,p%, defined in Theorem 7.4 exist, and calculate them. 


Hint. The only nonzero transition probabilities are 
j2 


mm ? Pija = mm 


j(m —j) (m —j? 
war ee Piju = 


io 


Ans. Solving the system 


PF = PEaPi-as + P3Pis + PRaPiss,s = (= 0, 1... m), 
we get p} = (Cj")*p3, and hence 
* (cp (cp? 
RE an emer ae 
2 (cer 
j=0 
(recall Problem 17, p. 12). 
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13. Find the stationary distribution p}, p{, . .. for the Markov chain whose only 
nonzero transition probabilities are 
1 


i . 
Pa eq? POS ea Gj =1,2,...). 


1 
DNB Yk 
(Ee re 
14. Two gamblers A and B repeatedly play a game such that 4’s probability 
of winning is p, while B’s probability of winning is gq = 1 — p. Each bet is a 
dollar, and the total capital of both players is m dollars. Find the probability 
of each player being ruined, given that 4’s initial capital is j dollars. 
Hint. Let ©; denote the state in which A has j dollars. Then the situation 
is described by a Markov chain whose only nonzero transition probabilities are 
Po =1, Pmm = 1, 
Pint =P» Pijra=F (j=1,...,m—1). 
Ans, Let fp; = lim pjo(m) be the probability of A’s ruin, starting with an 
n> 
initial capital of j dollars. Then 


Pr =PP2 +9 Pm = WPm—2» 
Pi =9Pi++ PPisr = (jf =2,---,m —2) 
(why ?). Solving this system of equations, we get 


(mil -£ (7.33) 
if p =q (as in Example 3, p. 29), and 
T= (pig 
= ———_ 7.34) 
Bs T= (lq (7.34) 


if p # q. The probability of B’s ruin is 1 — py. 


15. In the preceding problem, prove that if p > q, then 4's probability of ruin 
increases if the stakes are doubled. 


16. Prove that a gambler playing against an adversary with unlimited capital 
is certain to be ruined unless his probability of winning in each play of the game 
exceeds 4. 


Hint. Let m— © in (7.33) and (7.34). 
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CONTINUOUS MARKOV PROCESSES 


18. Definitions. The Sojourn Time 


Consider a physical system with the following properties, which are 
the exact analogues of those given on p. 83 for a Markov chain, except 
that now the time ¢ varies continuously: 


a) The system can occupy any of a finite or countably infinite number of 
states 1, &,... 

b) Starting from some initial state at time t = 0, the system changes its 
state randomly at subsequent times. Thus, the evolution of the system 
in time is described by the “random function’’ &(t), equal to the state 
of the system at time .1 

c) At time ¢ = 0, the system occupies the state ¢; with initial probability 


po=P{&0)=c}, i=1,2,... 


d) Suppose the system is in the state ¢, at any time s. Then the probability 
that the system goes into the state ¢, after a time ¢ is given by 


PAD =P{Es+)=e,(E) =e}, if=1,2,..., (81) 


regardless of its behavior before the time s. The numbers p;,(t), called 
the transition probabilities, do not depend on the time s. 


A random process described by this model is called a continuous Markov 


1 Recall footnote 1, p. 83. Note that &(r) is a random variable for any fixed f. 
102 
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process* or simply a Markov process (as opposed to a Markov chain, which 
might be called a “discrete Markov process’’). 
Let 
PMO=PEQ=5}, f=1,2,.-. 
be the probability that the system will be in the state <, at time ¢. Then, by 
arguments which hardly differ from those given on p. 84, we have 


pO)= pi, j=1,2,..., 


PAs + t) = > pels)Prs()s j=1,2,... (8.2) 
and i 
© 1 if j=i, 83) 
TE aaa (8. 
Pils +1) = & Picls)Prs(t, i,j =1,2,..% (8.4) 


for arbitrary s and ¢ [cf. (7.5) and (7.7)]. 


THEOREM 8.1. Given a Markov process in the state ¢ at time t = to, 
let z be the (random) time it takes the process to leave ¢ by going to 
some other state.’ Then 


PAco thie tO, (8.5) 
where i is a nonnegative constant. 
Proof. Clearly P {+ > t} is some function of t, say 
o(t) = P {rt >t}, Ro: 


If + > s, then the process will be in the same state at time f + s as at 
time ), and hence its subsequent behavior will be the same as if s = 0. 
In particular, 

P{r>s+t|t>5}= 9(t) 
is the probability of the event {tr > s + f} given that + > s. It follows 


that 
P{r>st+Hh=P{r>s+t|t>s}P {rt > s}= e(to(s), 
and hence 
ols + t) = 9(s)e(2) 
or equivalently 
In 9(s + t) = In ¢(s) + In 9(t) 


® More exactly, a continuous Markov process with stationary transition probabilities. 
where we allude to the fact that the numbers (8.1) do not depend on s (cf. footnote 2, 
p- 84). 

* Here we prefer to talk about states of the process rather than states of the system (as 
in Chap. 7). 
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for arbitrary s and ¢t. Therefore In 9(t) is proportional to ¢ (recall 
footnote 4, p. 40), say 


In o(t)= —-t, t>0, (8.6) 


where is some nonnegative constant (why nonnegative?). But (8.6) 
implies (8.5). J 


The parameter figuring in (8.4) is called the density of the transition 
out of the state «. If A = 0, the process remains forever in ¢. If A > 0, the 
probability of the process undergoing a change of state in a small time 
internal At is clearly 

1 — p(At) = A At + o(Ad), (8.7) 


where o(A?) denotes an infinitesimal of higher order than Ar. 
It follows from (8.5) that 


t 
P {t, <t < t2} = (ty) — (te) = eo — et = if “neat (8.8) 
ty 


for arbitrary nonnegative ft, and fy (t; < f,). Therefore the random variable 
t, called the sojourn time in state ¢, has the probability density 


w i. if t>0, (8.9) 
i= 5 
ui 0 if t<0. 
The distribution corresponding to (8.8) and (8.9) is called the exponential 
distribution, with parameter 4. The mean value Et, i.e., the “expected 
sojourn time in state e,” is given by 

Er -| tpt) dt = 

0 


Ie 


Example (Radioactive decay). In Example 3, p. 58, we gave a proba- 
bilistic model of the radioactive decay of radium (Ra) into radon (Rn). The 
behavior of each of the m) radium atoms is described by a Markov process 
with two states (Ra and Rn) and one possible transition (Ra > Rn). As on 
p. 58, let p(t) be the probability that a radium atom decays into a radon 
atom in time t, and &(t) the number of alpha particles emitted in ¢ seconds. 
Then, according to formula (5.7), 


a® 
PEM =k}= Te, k=0,1,2,..., 


where 


a = E&(t) = nop(t). 
It follows from (8.5) that 


po=1-—e*%, t>0, 
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where ? is the density of the transition Ra—> Rn. Recalling (8.7), we see that 
} is the constant such that the probability of the transition Ra — Rn in a 
small time interval Ar equals AAr + o(Ar). 

The number of (undisintegrated) radium atoms left after time ¢ is clearly 
my — &(t), with mean value 


n(t) = E[ny — &(t)] = mo — nop(t) = me“, t > Oz (8.10) 


Let T be the half-life of radium, i.e., the amount of time required for half the 
radium to disappear (on the average). Then 


n(T) = sto (8.11) 


and hence, comparing (8.10) and (8.11), we find that 7 is related to the 
density A of the transition Ra — Rn by the formula 
ra 82, 
a 


19. The Kolmogorov Equations 


Next we find differential equations satisfied by the transition probabilities 
of a Markov process: 


THEOREM 8.2. Given a Markov process with a finite number of states, 
suppose the transition probabilities p,,(t) are such that* 
1 — p,(At) = d, At + o(At), Ea ae oer 


8. 
PAN =AyAt+o(At), jH#i, i,f=1,2,..., G2) 


and let 
Vy 0 Oe (8.13) 


Then the transition probabilities satisfy two systems of linear differential 
equations, for forward Kolmogorov equations® 


Pi) => POM LJ =12,--. (8.14) 
k 
and the backward Kolmogorov equations 
PAD = Zar, j= 12-++ (8.15) 


subject to the initial conditions (8.3). 


4 We might call 2, the “density of the transition out of the state ¢;,” and A;; the “density 
of the transition from the state ¢; to the state ¢;.” 
® The prime denotes differentiation with respect to f. 
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Proof. It follows from (8.4) that 
Pit + At) = & Pal) AAD) = 2 Pi(At)p,.,(t)- 
Hence, using (8.12) and (8.13), we have 


Pit + At) — pit) o(Aty) _ o(At) 
TN: es Pall | Pas + IX: ] = [a+ iS rel eos 


Both sums have definite limits as At > 0. In fact, 


Him pa] ru + ae) - E Pale (8.16) 
fim > [m+ Pa = a 2 AiPasl- (8.17) 


Therefore 


1 m Distt + At) — p,,(t) = pif) 
Ato At 


also exists, and equals (8.16) and (8.17). ff 
Remark 1. It follows from (8.12) and the condition 
X piAt) =1 
that 
p= Ap (8.18) 
fi 
Remark 2, The Kolmogorov equations hold not only in the case of a 
finite number of states, but also in the case of a countably infinite number of 


states €,, &,... if we make certain additional assumptions. In fact, suppose 
the error terms o(Af) in (8.12) are such that 


(At) 
At 


uniformly in all i and j. Then the forward equations (8.14) hold if for any 
fixed j, there is a constant C < oo such that 


hy Ch. hme 


+0 as At—0 


while the backward equations (8.15) hold if the series (8.18) converges. 


Example 1 (The Poisson process). As in Example 4, p. 73, consider 
a “random flow of events” with density A, and let E(t) be the number of events 
which occur in time ¢. Then &(f) is called a Poisson process. Clearly &(t) is a 
Markov process, whose states can be described by the integers 0, 1, 2,... 
Moreover, &(t) can only leave the state i by going into the state i+ 1. 
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Therefore the transition densities A,; are just 
n if foi+l, 
0 if j#i,¢+1, 
i =-A, 
where we use (8.13) and (8.18). 


The transition probabilities p,,(t) of the Poisson process &(t) clearly 
satisfy the condition 


Pit) = Po,s-i(t) 


hy = 


(why ?). Let 
Pilt) = post), j=0,1,2,... 
Then the forward Kolmogorov equations take the form 


Po(t) = —APo(t), 
Pj(t) = Ap;_a(t) — Ap,{t), j=1,2,... 
Introducing the new functions 


fit) = pt), 7 =0,1,2,..., 
we find that 


So(t) = Molt) + epo(t) = Afol(t) — Ae po(t) = 0, 
Sit) = F(t) + ep i(t) 
= j(t) + re; _a(t) r rep ,(t) = Nid), J=1,2,..., 
fo(0) =1, 
f0) =0,  f=1,2,..., 
because of (8.3). But the solution of the system of differential equations 
So(t) = 0, 
AO=MMa, j=1,2,...5 
subject to the initial conditions (8.19), is obviously 


where 


(8.19) 


MO=1. HO=M..., HH=M,... 


Returning to the original functions p,(t) = ef,(t), we find that 


pit)= Ae, 7 =0,1,2,... 
or equivalently 
P{E®) =j}= 


just as on p. 75. 


a 
ON FiO, 20 cerns 
ij! 
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Example 2. (A service system with exponential holding times). Consider 
a random flow of service calls arriving at a server, where the incoming 
“traffic” is of the Poisson type described in Example 1, with density 4. Thus 
AAt + o(At) is the probability that at least one call arrives in a small time 
interval Ar, Suppose it takes a random time + to service each incoming call, 
where + has an exponential distribution with parameter p: 


Piz>th—er (8.20) 


(the case of “exponential holding times’’), Then the service system has two 
states, a state ¢, if the server is “free” and a state ¢, if the server is “busy.” 
It will be assumed that a call is rejected (and is no longer a candidate for 
service) if it arrives when the server is busy. 

Suppose the system is in the state ¢, at time f%. Then its subsequent 
behavior does not depend on its previous history, since the calls arrive 
independently. The probability (At) of the system going from the state 
€, to the state e, during a small time interval Af is just the probability AAr + 
o(At) of at least one call arriving during At. Hence the density of the 
transition from ¢ to ¢, equals A. On the other hand, suppose the system is 
in the state e, at time t,. Then the probability p,9(t) of the system going from 
the state ¢, to the state ¢, after a time ¢ is just the probability that service 
will fail to last another ¢ seconds.* Suppose that at the time 1,, service has 
already been in progress for exactly s seconds. Then 


P{t>s+t} 
t 1—P{t>sttiy.> 1 
Prot) { s |= s} P {t>s} 
Using (8.20), we find that 
en ulstt) 
Pro(t) = 1 — =1-e%, (8.21) 
a 


regardless of the time s, i.e., regardless of the system’s behavior before the 
time t,.? Hence the system can be described by a Markov process, with two 
states ¢) and ¢,. 
The transition probabilities of this Markov process obviously satisfy the 
conditions 
Pott) = 1 — poolt), Prot) = 1 — pult)- (8.22) 
Moreover, 
roo = —A, ror = As 


Ao = Bs Au = 
* For simplicity, we choose seconds as the time units. 


7 It is important to note that this is true only for exponential holding times (see W. 
Feller, op. cit., p. 458). 
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where we use the fact that 
Pi At) = 1 — e*4¢ = w At + ofAt). 


Hence in this case the forward Kolmogorov equations (8.14) become® 


Poo(t) = AooPoo(t) + AroPor(t) APoo(t) + ULL — Pool t)], 
Pis(t) = AorPro(t) + ArrPar(t) = ALL — Poolt)] — wParlt), 


i.e., 
Pil!) + (A+ p)Poolt) = us (8.23) 
Pult) + (A + 2)pui(t) = 2. 
Solving (8.23) subject to the initial conditions 
Poa(0) = Pux(0) = 1, 
we get 
E (ob )t u 
Poo(t) (! ) e + , 
nN a 
ii a Z (8.24) 
) = (1 —-—) err + _, 
Pull) ( ee - 


20. More on Limiting Probabilities. Erlang’s Formula 


We now prove the continuous analogue of Theorem 7.4: 


THEOREM 8.3, Let &(t) be a Markov process with a finite number of 
states, €,.. +, &m, each accessible from every other state. Then 


lim p,(t) = pj, 


tao 
where p,(t) is the probability of &(t) being in the state ¢, at time t. The 
numbers p*, j=1,...,m, called the limiting probabilities, do not 


depend on the initial probability distribution and satisfy the inequalities 
max |p,,(t) — pf] < Ce? [p,(t) — pi] < Ce“ ?* (8.25) 
i 


for suitable positive constants C and D. 


Proof. The proof is virtually the same as that of Theorem 7.4 for 
Markov chains, once we verify that the continuous analogue of the 
condition (7.20), p. 93 is automatically satisfied. In fact, we now have 


min p,,(t) = 8(t) > 0 (8.26) 
ig 


® Because of (8.22), there is no need to write equations for po,(t) and pio(t). 
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for all t > 0. To show this, we first observe that p,,(t) is positive for 
sufficiently small ¢, being a continuous function (why?) satisfying the 
condition p,,(0) = 1. But, because of (8.4), 
PilS + 1) > puls)pult) 

for arbitrary s and t, and hence p,,(t) is positive for all ft. 

To show that p,,(t), i Aj is also positive for all t, thereby proving 
(8.26) and the theorem, we note that 

Pis(s) > 0 


for some s, since ¢, is accessible from ¢,. But 


Pi(t)> Pupy(t—), uct, 

again by (8.4), where, as just shown, p,,(f —u) is always positive. 
Hence it suffices to show that p,,(u) > 0 for some u < t. Consider a 
Markov chain with the same states <,...,¢€,, and transition proba- 
bilities 

5 

Pi = p(>); 
n 


where n is an integer such that 


Since 


p(n - = 10; 


the state e, is accessible from ¢,. But it is easy to see that ¢, is accessible 
from ¢, not only in n steps, but also in a number of steps my) no greater 
than the total number of states m (think this through). Therefore 


Dale *) > 0, 
n 
where 


Ss 
nm-=u<t. 
n 


The limiting probabilities p¥, = 1, ... , mforma stationary distribution 
in the same sense as on p. 96. More exactly, if we choose the initial distri- 


bution 
P=pi, j=i,....m, 
then 

pAt)= pi, j=i,...,m, 


i.e., the probability of the system being in the state c, remains unchanged 
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for all > 0. In fact, taking the limit as s—> co in (8.2), we get 
PF = dra), j=l,....m. (8.27) 


But the right-hand side is just p,(t), as we see by choosing s = 0 in (8.2). 
Suppose the transition probabilities satisfy the conditions (8.12). Then 
differentiating (8.27) and setting t = 0, we find that 


Dover 1058) GL yccain th, (8.28) 
a 
where A,; is the density of the transition from the state ¢; to the state ¢,. 


Example (A service system with m servers). Consider a service system 
which can handle up to m incoming calls at once, i.e., suppose there are m 
servers and an incoming call can be handled if at least one server is free. As 
in Example 2, p. 108, we assume that the incoming traffic is of the Poisson 
type with density 4, and that the time it takes each server to service a call 
is exponentially distributed with parameter u (this is again a case of “‘expo- 
nential holding times”). Moreover, it will be assumed that a call is rejected 
(and is no longer a candidate for service) if it arrives when all m servers are 
busy, and that the “holding times” of the m servers are independent random 
variables. 

If precisely j servers are busy, we say that the service system is in the 
state «, (j= 0,1,...,m). In particular, ¢9 means that the whole system is 
free and ¢,, that the system is completely busy. For almost the same reasons 
as on p. 108, the evolution of the system in time from state to state is described 
by a Markoy process. The only nonzero transition probabilities of this 
process are 


oo Ay Kor = 2s Amm my, 
Aye =I yg = —A+i), Ans =% CG=1L-..,m—). 


In fact, suppose the system is in the state ¢,, Then a transition from ¢, to 
€,,, takes place if a single call arrives, which happens in a small time interval 
At with probability AAr + o(Ar).® Moreover, the probability that none of the 
j busy servers becomes free in time Ar is just 


[1 — pAt + o(Ad)), 


since the holding times are independent, and hence the probability of at 
least one server becoming free in time At equals 


1 — [1 — pAt + o(At)} = juAt + o(Ar). 


(8.29) 


® For small Ar, this is also the probability of at least one call arriving in Ar. 
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But for small Ar, this is also the probability of a single server becoming free 
in time Af, i.e., of a transition from ¢, to ¢;_,. The transitions to new states 
other than e,_, or ¢;,, have small probabilities of order o(At). These con- 
siderations, together with (8.12) and the formula 


DAs =0 
7] 


implied by (8.12) and (8.13), lead at once to (8.29). 

In the case m = 1, it is clear from the formulas (8.24) that the transition 
probabilities p,,(¢) approach their limiting values “exponentially fast’? as 
t— oo. It follows from the general formula (8.25) that the same is true in the 
case m > 1 (more than | server). To find these limiting probabilities pf, 
we use (8.28) and (8.29), obtaining the following system of linear equations: 


Apo = uP, 
(A+ ju)py = phat Gt epi = (G=1,...,m—1), 
Pn = MP ee 
Solving this system, we get 


1 (a r 
pt = (2) j=0,1,...,m. 
iN\y 
Using the “normalization condition” 
te! 
j=0 


to determine p*, we finally obtain Erlang’s formula 


ah 
sc JIN FON eeon (8.30) 


zit) 


for the limiting probabilities. 


PROBLEMS 


1. Suppose each alpha particle emitted by a sample of radium has probability 
p of being recorded by a Geiger counter. What is the probability of exactly n 
particles being recorded in ¢ seconds? 

Qpt)" : ; 
Ans. aay et, where 2 is the same as in the example on p. 104. 
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2. A man has two telephones on his desk, one receiving calls with density 4, 
the other with density 9.1° What is the probability of exactly n calls being 
received in ¢ seconds? 

Hint. Recall Problem 9, p. 81. Neglect the effect of the lines being found 
busy. 

(Q1 + 2) e—(rt Ag), 
n! 

3. Given a Poisson process with density 4, let &(¢) be the number of events 
occurring in time f. Find the correlation coefficient of the random variables 
E(t) and E(t + +), where + > 0. 


Ans. lace 


4. Show that (8.24) leads to Erlang’s formula (8.30) for m = 1, 

5. The arrival of customers at the complaint desk of a department store is 
described by a Poisson process with density A, Suppose each clerk takes a 
random time + to handle a complaint, where + has an exponential distribution 
with parameter 1, and suppose a customer leaves whenever he finds all the clerks 
busy. How many clerks are needed to make the probability of customers 
leaving unserved less than 0.015 if 4 = 4? 


Hint. Use Erlang’s formula (8.30). 
Ans. Four. 


6. A single repairman services m automatic machines, which normally do not 
require his attention. Each machine has probability 4Ar + o(At) of breaking 
down in a small time interval Av. The time required to repair each machine is 
exponentially distributed with parameter u. Find the limiting probability of 
exactly / machines being out of order. 


Hint. Solve the system of equations 


maps = upts 
(om — fy) + ulpf = On — jf + Dip + opis 
Pm = Pm 
m! ry 
Ans. ot =a (chor PHO, Uys my 


m 
where p% is determined from the condition > pe =i. 
=0 


Comment. Note the similarity between this result and formula (8.30). 
7. In the preceding problem, find the average number of machines awaiting 
the repairman’s attention. 


A+ 
Ans. m a Ci po): 


10 Tt is assumed that the incoming calls on each line form a Poisson process. 
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8. Solve Problem 6 for the case of r repairmen, where 1 <r < m. 


9. An electric power line serves m identical machines, each operating inde- 
pendently of the others. Suppose that in a small interval of time Ar each machine 
has probability Ar + o(Ar) of being turned on and probability wAr + o(Ar) 
of being turned off. Find the limiting probability pj of exactly 7 machines 


being on. 
Hint. Solve the system of equations 
maps = upt, 
[Gm — jr + julpf = (m —j + Depts + G+ Dephy 
mepn = Pina 


e (Cos Ve) (mS : 
Ans. pt -on(4,) (=). j=0,1,...,m. 


10. Show that the answer to the preceding problem is just what one would 
expect by an elementary argument if 4 = u. 


i 


Appendix l 


INFORMATION THEORY 


Given a random experiment with N equiprobable outcomes 4,,..., Ay, 
how much “information’”’ is conveyed on the average by a message 
telling us which of the outcomes A;,..., Ay has actually occurred? As a 
reasonable measure of this information, we might take the average length of 
the message .#, provided .#@ is written in an “economical way.” For 
example, suppose we use a “‘binary code,” representing each of the possible 
outcomes A;,..., Ay by a “code word” of length /, i.e., by a sequence 


a... ay 
where each “‘digit’’ a, is either a 0 or a 1, Obviously there are 2' such words 
(all of the same length /), and hence to be capable of uniquely designating the 
N possible outcomes, we must choose a value of / such that 

N <2. (1) 
The smallest value of / satisfying (1) is just the integer such that 

0</—log,N <1. 
This being the case, the quantity 
I = log, N (2) 
is clearly a reasonable definition of the average amount of information in the 
message .@ (measured in binary units or “bits”). 
More generally, suppose the outcomes A;,..., Ay have different proba- 
bilities 
Pi= P(A), --- Py = P(Ay). (3) 
115 
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Then it is clear that being told about a rare outcome conveys more informa- 
tion than being told about a likely outcome. To take this into account, we 
repeat the experiment n times, where n is very large, and send a new message 
AM’ conveying the result of the whole series of n trials. Each outcome is now 
a sequence 

A, 


hess cay (4) 
where 4,, is the outcome occurring at the kth trial. Of the N” possible out- 
comes of the whole series of trials, it is overwhelmingly likely that the outcome 
will belong to a much smaller set containing only 


Naat 6) 


outcomes, where 

ny = NpPy,..+,Ny = "Py, m+ +ny =H. 
In fact, let n,; = n(A,) be the number of occurrences of the event A; in NV 
trials. Then 


= 


n 


by the law of large numbers (see Sec. 12), and hence n, ~ np,. To get (5), 
we merely replace ~ by = and invoke Theorem 1.4, p. 7. We emphasize 
that this is a plausibility argument and not a rigorous proof,’ but the basic 
idea is perfectly sound. 

Continuing i in this vein, we argue that only a negligibly small amount of 
information is lost on the average if we neglect all but the set of N,, highly 
likely outcomes of the form (4), all with the same probability 


PA(,,) ++ P(A;,) = Pit + ** PN”. 


This brings us back to the case of equiprobable outcomes, and suggests 
defining the average amount of information conveyed by the message .“’ as 


= log: Ny, 


Hence, dividing by the number of trials, we find that the average amount of 
information in the original message .# is just 


putes, ©) 
n 


1 In particular, no information at all is conveyed by being told that the sure event has 
occurred, because we already know what the message will be! 

? Among other missing details, we note that the numbers 7, ... , my are in general not 
all integers, as assumed in (4). 
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To calculate (6), we apply Stirling’s formula (see p. 10) to the expression 
(5), obtaining 


V2nn n"e-™ 
Ni~ : 
A 2mny nye t+: V2nny nyve—™* 
and hence 
In VN, ~n Inn — np, in (np,) — -- + — npy In (npy) 
nInn — (np, ++**+ npy)Inn — np,In pj — +++ —npy In py 


N 
=—nX pln p, 


i=1 


in terms of the natural logarithm 


In x = log, x, 
or equivalently 
N 
log, N,~ —n 2p loge Py (7) 


in terms of the logarithm to the base 2. Changing ~ to = and substituting 
(7) into (6), we get Shannon's formula 


N 
Tiss — Z pilose Pi (8) 
for the average amount of information in a message ./@ telling which of the 
N outcomes A,,..., Ay with probabilities (3) has occurred. Note that (8) 


reduces to (2) if the outcomes are equiprobable, since then 


Pr Py ps 

Example 1 (Average time of psychological reaction). One of N lamps is 
illuminated at random, where p, is the probability of the ith lamp being turned 
on, and an observer is asked to point out the lamp which is lit. In a long 
series of independent trials it turns out* that the average time required to give 
the correct answer is proportional to the quantity (8) rather than to the 
number of lamps N, as might have been expected. 


We can interpret the quantity (8) not only as the average amount of 
information conveyed by the message .#, but also the average amount of 
“uncertainty” residing in the given random experiment, and hence as a 
measure of the randomness of the experiment. Receiving the message 
reduces the uncertainty of the outcome of the experiment to zero, since the 


3 See A. M. Yaglom and I. M. Yaglom, Wahrscheinlichkeit und Information, second 
edition, VEB Deutscher Verlag der Wissenschaften, Berlin (1965), p. 67. 


118 INFORMATION THEORY APP. 1 


message tells us the result of the experiment with complete certainty. More 
generally, we might ask for the amount of information about one “full set” 
of mutually exclusive events A;,...,Ay conveyed by being told which of 
a related full set of mutually exclusive events B,,..., By, has occurred. 
Suppose the two sets of events have probabilities P(A,),..., P(Ay) and 
P(B,), ... , P(By’), where P(A,) + +++ + P(Ay) = 1, 


P(B,) + +++ + P(By,) = 1. 


Moreover, let P(A,B;) be the probability that both events A; and B; occur, 
while P(A, | B,) is the probability of A, occurring if B; is known to have 
occurred. Then 


N 
Lap, = — 2 Pl | B,) log, P(A; | B,) 


is the amount of uncertainty about the events A,,..., Ay remaining after 
B, is known to have occurred, and hence 


N’ N’ N 
lye = — 2 PBI, = -2 > P(B,)P(A, | B,) logs P(A, | B,) 


j=li=1 
P(A,B,) 
= —) P(A,B,) logs —+* 9) 
2 (A,B;) logs P(B,) ( 

is the average amount of uncertainty about 4,,..., Ay remaining after it is 
known which of the events B,,..., By, has occurred. Let 4, be the in- 
formation about the events A,,..., Ay conveyed by knowledge of which of 

the events B,,..., By, has occurred. Then clearly* 
Lp = 14 — Lap, (10) 


where 
N NN 
La = 2 PAD logs P(A;) = — ¥ > P(A,B,) logs P(A) (11) 


i=lj=1 
is the quantity previously denoted by / (justify the last step). Combining (9) 
and (11), we finally get 
P(A;B; 
Lup = FAB) 10m, aoe 


P(A,)P(B;) ee 


Example 2. (Weather prediction). During a certain season it rains about 
once every five days, the weather being fair the rest of the time. Every night 
a prediction is made of the next day’s weather. Suppose a prediction of rain 
is wrong about half the time, while a prediction of fair weather is wrong 


+ In words, (10) says that “the information in the message’ equals “the uncertainty 
before the message is received” minus “the uncertainty after the message is received.” 
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only about one time out of ten. How much information about the weather 
is conveyed on the average by the predictions? 


Solution. Let A; denote rain, A, fair weather, B, a prediction of rain 
and B, a prediction of fair weather. Then, to a good approximation, 


1 4 
P(A) = ’ P(A,) = 5 
1 1 
P(A, |B) =5, P(A\| Bs) = 75 - 


Moreover, since 


P(A,) = P(A, | B,)P(By) + P(A, | B,)P(B:), 
we have 
Eas P(B,) + : 1— PB, 
pao (B) To | — P(B,)), 
and hence 


1 3 
P(B) = 7? P(B,) = 7? 


1 
P(A,B,) = P(A, | B,)P(B,) = z 
P(A,B,) = P(A; | B,)P(Bs) = i ; 


1 
P(42B,) = [1 — P(A, | B,)]P(B,) = 3” 


27 


P(A,B,) = [1 — P(A, | B,)\P(Bs) = re 


Tt follows from (12) that 
1 5 3 joe Eee 9 
lag = gees + 7p 9825 + g obs + 7 83 ww 0.12 


is the average amount of weather conveyed by a prediction. In the case of 
100%, accurate predictions, A, = B,, Ag = B, and (12) reduces to 


1 4 4 
Ig3 = — 5 10825 — 5 logs re 0.72. 


PROBLEMS 


1, Which conveys more information, a message telling a stranger’s birthday or 
a message telling his telephone number? 


120 INFORMATION THEORY APP. 1 


2. Find the average amount of information in bits of a message telling whether 
or not the outcome of throwing a pair of unbiased dice is 
a) Anoddnumber; b) Aprimenumber; c) A number no greater than 5. 


3. An experiment has four possible outcomes, with probabilities Ye Yr TF 
and ,',, respectively. What is the average amount of uncertainty about the 


outcome of the experiment? 


4. Each of the signals Ay,..., An has equal probability of being transmitted 
over a communication channel. In the absence of noise, the signal A; is received 
as a;(j =1,...,m), while in the presence of noise A; has probability p of being 
received as a; and equal probability of being received as any of the other symbols. 
What is the average amount of information about the symbols Ay,..., An 
conveyed by receiving one of the signals a,,..., an 

a) In the absence of noise; b) In the presence of noise? 


1-— 
Ans. a) logan; b) logan + p loge p + (1 —p)log,-—".. 


Appendix 2 


GAME THEORY 


Consider the following simple model of a game played repeatedly by two 
players.’ Each player can choose one of two strategies determining the result 
of the game. The interests of the players are completely conflicting, e.g., 


whatever one player wins, the other loses. Such a 
“two-person game” can be described by the table 
shown in Figure 9, where the quantity in the ith row 
and jth column is the amount gained by the first player 
if he chooses strategy i while his opponent chooses 
strategy j (i,j = 1, 2). For example, v,, is the amount 
gained by the first player (the first player's “payoff”’) if 
he chooses the first strategy and his opponent (the 
second player) chooses the second strategy, while —v., 
is the second player’s payoff if he chooses strategy 1 and 
his opponent (the first player) chooses strategy 2. It is 
now natural to ask for each player’s “optimal strategy.” 
This question is easily answered in the case where 


min (v41, Yj2) > MAX (V1, V2), 


“My "2 
va Yee 
Ficure 9 

(dy 


say, since then regardless of how the second player acts, the first player 


1 More generally, a game of strategy involves more than two players, each with more 
than two available strategies, but the essential features of game theory (in particular, its 


connection with probability) emerges even in this extremely simple case. 


* Such a game, in which the algebraic sum of the players’ winnings is zero, is called a 


zero-sum game, 
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should always choose the first strategy, thereby guaranteeing himself a gain 
of at least 
min (X41, M2). 


Assuming a “clever”? opponent, the second player should then choose the 
strategy which minimizes the first player’s maximum gain, i.e., the strategy j 
such that 

Yyy = Min (041, Vy»). 


The case just described is atypical. Usually, a relation like (1) does not 
hold, and each player should adopt a “mixed strategy,’’ sometimes choosing 
one of the two “‘pure strategies” available to him and sometimes choosing the 
other, with definite probabilities (found in a way to be discussed), More 
exactly, the first player should choose the ith strategy with probability p,,, 
while the second player should (independently) choose the jth strategy with 
probability p.;. Then the first player’s strategy is described by a probability 
distribution P, = {p11, P12}, while the second player’s strategy is described 
by a probability distribution P, = {po;, Po:}. If these mixed strategies are 
adopted, the average gain to the first player is clearly just 


2 
V(P1, Ps) =o & PsPuPas (2) 


Suppose the second player makes the optimal response to each strategy 
P, = {pu Piz} chosen by the first player, by adopting the strategy Pf = 
{P31 Py} Minimizing the first player’s gain. The first player then wins an 
amount 


V(Py, P#) = min V(P;, P2) = V4(P3) 
Po 


if he chooses the strategy P,. To maximize this gain, the first player should 
choose the strategy P? = {p?,, p?,} such that 


V,(P?) = max V,(P,), 
Py 


always, of course, under the assumption that his opponent plays in the best 
possible way. Exactly the same argument can be applied to the second player, 
and shows that his optimal strategy, guaranteeing his maximum average gain 
under the assumption of optimal play on the part of his opponent, is the 
strategy P} = {p2,, Pg.} such that 


Vo(P2) = max V2(P3), 
Pa 
where 
V(P2) = ose Ves P,)}. 
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To calculate the optimal strategies P? and P$, we consider the function 
V(x, y) = vuxy + Mex(1 — y) + Yq (1 — x)y + veel — x)(1 — y), 


which for x = p,,; and y = p,, equals the average gain of the first player if 
the mixed strategies P, = {py, P12} and Py = {px, Po} are chosen. The 
function V(x, y) is linear in each of the variables x and y,0< x,y <1. 
Hence, for every fixed x, V(x, y) achieves its minimum V,(x) at one of the 
end points of the interval 0 < y < l,ie., fory=Oory=1: 


V(x) = min V(x, y) = min {v4.x + vy9(1 — x), 01x + Ya(1 — x)}. 
¥ 


As shown in Figure 10, the graph of the function V;(x) is a broken line with 


YEUX *Vo(1-4) 


1 


ne 


| 
| 
i 
+ al 
| 
| 
| 
| 
| 


Ficure 10. A case where min (v,1, ¥12) < max (vai, U2a)- 
vertex at the point x° such that 
DyaX° + Vg9(L — x°) = 04x° + var(l — x°), 
i.e., at the point 


nthe Vag — Var b (3) 
Vy, + Vag — (Vy2 + Vax) 


The value x = x® for which the function V,(x), 0 < x < 1 takes its 
maximum is just the probability p?, with which the first player should 
choose his first pure strategy. The corresponding optimal mixed strategy 
P} = {pt,, p%,} guarantees the maximum average gain for the first player 
under the assumption of optimal play on the part of his opponent. This 


gain is 


Vy (x9) = vy1x° + Vq1(1 — x°) = vy9x° + Voa(1 — x°). (4) 
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Moreover, (4) implies 
V(x°, y) = yeux? 4 Vy (1 x°)] + (L — y)[erex° + vee(1 x°)] Vi (x°) 


for any y in the interval 0 < y < 1. Hence, by choosing p?, = x°, the first 
player guarantees himself an average gain V;(x°) regardless of the value of 
y, i.e., regardless of how his opponent plays. However, if the first player 
deviates from this optimal strategy, by choosing py = x # x°, then his 
opponent need only choose pz; = y equal to 0 or 1 (as the case may be) to 
reduce the first player’s average gain to just V,(x). 

Applying the same considerations to the second player, we find that the 
second player’s optimal strategy is such that p?, = y°, where 


te Vir — Ye (5) 
041 + Vog — (rg + Yar) 
[(5) is obtained from (3) by reversing the roles of players | and 2, i.e., by 
interchanging the indices 1 and 2]. As in the case of the first player, this 
choice guarantees the second player an average gain V(y°) regardless of the 
first player’s strategy, i.e., 
—V(x,y)= Vil), O<x<l. (6) 
In particular, (4) and (6) imply 
V(x) = VO, y"), 
V2(y°) = —VO>, y®). 


Example 1. One player repeatedly hides either a dime or a quarter, and 
the second player guesses which coin is hidden. If he guesses properly, he 
gets the coin, but otherwise he must pay the first player 15 cents. Find both 
players’ optimal strategies. 

Solution. Here 

%,, = —10, M2 = 15, 
15, Veg = —25, 


ll 


Vor 


so that, by (3), 
ie ot 
—35—30 13° 
Therefore the first player should hide the dime with probability 3°;, and hide 
the quarter with probability 3°. Similarly, by (5), 
6-10-15 _ 5 
—35—30 13’ 


oer 
Par SX 


pau =y 


3 For the first player, hiding the dime is (pure) strategy 1, and hiding the quarter is 
strategy 2. For the second player, guessing the dime is strategy 1, and guessing the quarter is 
strategy 2. 
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and hence the second player should guess that the hidden coin is a dime with 
probability :, and that it is a quarter with probability ;°5. Then, according 
to (4), the first player’s average gain will be 


8 5 5 
0 10 ete a 
eee Ps, F155 ia 
while the second player’s average gain will be 
3) 
—V(x°, y%) = — 
V(x, y°) a" 


Thus this game is unfavorable to the first player, who loses an average of 
7's cents every time he plays, even if he adopts the optimal strategy. However, 
any departure from the optimal strategy will lead to an even greater loss, if 
his opponent responds properly. 


Example 2 (Aerial warfare). White repeatedly sends two-plane missions 
to attack one of Blue’s installations. One plane carries bombs, and the 
other (identical in appearance) flies cover for the plane carrying the bombs. 
Suppose the lead plane can be defended better by the guns of the plane in the 
second position than vice versa, so that the chance of the lead plane surviving 
an attack by Blue’s fighter is 80°, while the chance of the plane in the second 
position surviving such an attack is only 60%. Suppose further that Blue 
can attack just one of White’s planes and that Blue’s sole concern is the 
protection of his installation, while White’s sole concern is the destruction 
of Blue’s installation. Which of White’s planes should carry the bombs, and 
which plane should Blue attack? 


Solution. Let White’s payoff be the probability of accomplishing the 
mission. Then 


Py, = 0.8, t= 1, 


Y= 1, U2 = 0.6, 
and hence 
—04 2 —0.2_ 1 
0 x? 2. 0 0 a 
Pu Ee a Paro AiG 3 


by (3) and (5). Thus always putting the bombs in the lead plane is not 
White's best strategy, although this plane is less likely to be shot down than 


* After J. D. Williams, The Compleat Strategyst, McGraw-Hill Book Co., Inc., New 
York (1954), p. 47. 

* For White, putting the bombs in the lead plane is (pure) strategy 1, and putting the 
bombs in the other plane is strategy 2. For Blue, attacking the lead plane is strategy 1, 
and attacking the other plane is strategy 2. 
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the other. In fact, if White always puts the bombs in the lead plane, then 
Blue will always attack this plane and the resulting probability of the mission 
succeeding will be 0.8. On the other hand, if White adopts the optimal mixed 
strategy and puts the bombs in the lead plane only two times out of three, 
he will increase his probability of accomplishing the mission by 7s, since, 
according to (4), 
2 8 a3 
© 0) — _-_ - —_ 
VX, y 3 To +5 =i5° 

By the same token, Blue’s best strategy is to attack the lead plane only one 
time out of three and the other plane the rest of the time. 


PROBLEMS 


1, Prove that the game considered in Example 1 becomes favorable to the first 
player if the second player’s penalty for incorrect guessing is raised to 20 cents. 


2. In Example 1, let a be the second player’s penalty for incorrect guessing. For 
what value of a does the game become “fair” ? 


3. Blue has two installations, only one of which he can successfully defend, 
while White can attack either but not both of Blue's installations. Find the 
optimal strategies for White and Blue if one of the installations is three times as 
valuable as the other.® 


Ans, White should attack the less valuable installation 3 out of 4 times, 
while Blue should defend the more valuable installation 3 out of 4 times. 


* After J. D. Williams, op. cit., p. 51. 


Appendix 3 


BRANCHING PROCESSES 


Consider a group of particles, each “randomly producing”’ more particles 
of the same type by the following process: 


a) The probability that each of the particles originally present at some 
time ¢ = 0 produces a group of k particles after a time ¢ is given by 
p,(t), where k = 0, 1, 2,... and p,(t) is the same for all the particles.* 

b) The behavior of each particle is independent of the behavior of the 
other particles and of the events prior to the initial time t = 0. 


A random process described by this model is called a branching process. 
As concrete examples of such processes, think of nuclear chain reactions, 
survival of family names, etc.? 

Let £(t) be the total number of particles present at time t. Then E(t) is a 
Markoy process (why?). Suppose there are exactly k particles initially 
present at time ¢ = 0, and let E,(t) be the number of particles produced by 
the ith particle after a time ¢. Then clearly 


FH) =A +-°° + 2), 09) 


where the random variables £,(t),..., &(t) are independent and have the 
same probability distribution 


PE) =n} =p), n=0,1,2,... 


1 The case k = 0 corresponds to “annihilation” of a particle. 
* Concerning these examples and others, see W. Feller, op. cif., p. 294. 
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Let p,,(t) be the probability of the k particles giving rise to a total of n 
particles after time ¢, so that the numbers p,,(t) are the transition proba- 
bilities of the Markov process &(t), and introduce the generating functions* 


F(t, 2) = ¥ palo", @ 
Fults 2) = Paa(2". @) 


Suppose the probability of a single particle giving rise to a total of n particles 
in a small time interval At is 


Pp(At) = 2, At + o(At), 
while the probability of the particle remaining unchanged is 
pi(At) = 1 — AAt + o(At). 
Moreover, let 
A= A, 
so that 


p2 %=0. (4) 


Then the Kolmogorov equations (8.15), p. 105 for the transition probabilities 
Prt) = Pin(t) become 


d 
— prt) = & Perit), n=0,1,2,... 
dt K 


Next we deduce a corresponding differential equation for the generating 
function F(t, z). Clearly 


d _d@ gee in 2 
Be) ae 2D paz => 4 nen) DD Peal? (5) 


(justify the term-by-term differentiation), where F,(t, z) is the generating 
function of the random variable &(t) for the case of k original particles.* 
But, according to (1), &(t) is the sum of k independent randomvariables, each 
with generating function F(t, z). Therefore, by formula (6.7), p. 71, 


F,(t,z)= (FC, 2)", &=0,1,2,... (6) 
(the formula is trivial for k = 0). Substituting (6) into (5), we get 


4 FG, 2) =D aR DF. @) 
dt k 


* Note that F,(t, z) = F(t, z), since clearly p,,(t) = pn(t). 
‘ Clearly Fy(z) = 1, since new particles cannot be created in the absence of any original 
particles, 
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In what follows, we will assume that a given branching process &(t) is 
specified by giving the transition densities 4,, k = 0,1,2,... Let f(x) be 
the function defined by the power series 


f(x) =¥ rex’ @) 


so that in particular f(x) is analytic for 0 < x < 1. Then, according to (7), 
the generating function F((, z) satisfies a differential equation of the form 


ax 
am f(x). (9) 


Moreover, since F(0,z) = z, the generating function F(t, z) coincides for 
every z in the interval 0 < z < 1 with the solution x = x(t) of (9) satisfying 
the initial condition 


x(0) = z. (10) 
Instead of (9), it is often convenient to consider the equivalent differential 
equation 

Ch (11) 


for the inverse f = t(x) of the function x = x(t). The function satisfying 
(11) and the initial condition (10) is just 


0 1. 
t=[ T° <x< 


Example 1. If 


then 


S(x) = A — x) 


and 
= du 
2S (u) 
Hence F = F(t, z) is such that 
In (1 — F) = —At+ In(1 — 2), 


tin (1 Syn Sayp 


Le, 
F(t, z) = 1 — e(1 — z). 
The probabilities p,,(t) are found from the expansion 


F(t, 2) = 3 p,(02", 
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which in this case implies 
Mt, 


Pot) =1—e Pie 
Pot)=0, n=2,3,... 


aye, 


Example 2. If 


iy = k =2,3,..., 
(k — Dk 
then 
Ps wo xk 
dy = — 
fox) = Snyxt = $ > SE 
xIn(l — x) +In(@i —x) =(1 — x) In (1 — »), 
and hence 


* du " du In (1a) du 
: wis la —u)In(i—u) lina u 


= —InIn(1 — x) + InIn(1 — z). 
It follows that F = F(t, z) is such that 
Ini (ee) ee 
In (1 — z) 
Lei, 
F(t,z)=1—(1—2)*.. 
To find the corresponding probabilities p,,(t), we use repeated differentiation: 
Plt)=0, pilt)=et 


m 
ieee =( Ay itentes yee (et — 1-10): 
ee eee, 


P,(t) 


Turning to the analysis of the differential equation (9), where (x) is 
given by (8), we note that 


f'O)=Tkk—DYyxt2>0 if O<x<. 
k=2 


Therefore /(x) is concave upward in the interval 0 < x < 1, with a monotoni- 
cally increasing derivative. Because of (4), x = 1 is a root of the equation 
f(x) = 0. This equation can have at most one other root x = « (0 < « < 1). 
Thus f(x) must behave in one of the two ways shown in Figure 11. 

We now study the more complicated case, where f(x) = 0 has two roots 
x=a (0<«<1) and x=1, corresponding to two singular integral 
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yrflx) Xo 


Figure 11 


curves x(f)=« and x(t)=1 of the differential equations (9) and (11). 
Consider the integral curve 


® du 
t ={ a (12) 


going through the point t=0, x =z (0<z<«a). Since the derivative 
Ff'(@) is finite and f(x) ~f'(x)(x — «) for x ~ «, the value of t along the 
integral curve (12) increases without limit as x — «, but the curve itself never 
intersects the other integral curve x(t) = «. The function (x) is positive in the 
interval 0 < x < «, and hence the integral curve x = x(t) increases mono- 
tonically as t > oo, remaining bounded by the value x = «. Being a bounded 
monotonic function, x(t) has a limit 


B=limx), 2<B<«. 
to 


But f(x) approaches a limit /(@) as x > 8, i.e., 
f®@)= Be [x()] = 1s x(t), 
where /() must vanish, since otherwise the function 


x0 =2 +f fix ds 


would increase without limit as t— oo. It follows that ® is a root of the 
equation f(x) = 0, and hence must coincide with «. Therefore all the integral 
curves x = x(t) going through the point x = z,0<z<a for t=0 in- 
crease monotonically as t > oo and satisfy the condition 

lim x(t) = «. (13) 


t>00 
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The behavior of the integral curves going through the point x = z,a<z<1 
for t= 0 is entirely analogous. The only difference is that x(t) now de- 
creases monotonically, since the derivative x’(t) = /[x(t)] is negative and 
f(x) < 0 for « <x < 1, The behavior 
of typical integral curves in the interval 
0<z<l1 is shown in Figure 12, 
{ where 0<2z,< a<2,< 1. 

22 The behavior of the integral curves 
at z= 1 warrants special discussion. 
First we note that in any case x(t) = 1 is 
an integral curve corresponding to z = 
4 1, Suppose 


t 1 
0 [#=-« (14) 
Ficure 12 ao f(x) 


a 


for some X9, % < X) < 1.° Then an arbitrary integral curve of the form 
=t+ 0<x<1, (15) 
; S50 » fu)’ 


going through some point (f9, x9), decreases without limit as x — 1, i.e., 


* du 
t=t+ > —0 
a f(u) 


as x > 1. This shows that given any f) > 0, the equation 


* du_ 
t(z) = t+ he Ral 


holds for some x = z, « <_z < 1. Hence every integral curve intersects the 
axis t = 0 in a point (0, z) such that « < z < 1 (see Figure 13). It follows 
that in this case x(t) = 1 is the unique integral curve going through the point 
(0, 1). 
On the other hand, suppose 
Dadam 

®9 4) 
Then for sufficiently large fo, the integral curve (15) intersects the integral 
curve x(t) = 1, and is in fact tangent to it at the point (+, 1) where 


ati he 
« f(x) 


® This is always the case if f’(1) < 00 (why?). 


=: (16) 


=to+ 
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(see Figure 13). In this case, thereis + 

a whole family of integral curves x,(t) 

going through the point (0,1), where 1 is 

each x(t) is parameterized by the alt) 
appropriate value of t > 0. Among 

these integral curves, the curve x(t) 

shown in the figure, corresponding to 

the value + = 0, has the property of 

lying below all the other integral >t 
CUrves, 1.€., Figure 13 


Xo(t) < x(t), 0<t<o. 


This is explained by the fact that the solution of our differential equation is 
unique in the region 0< x <1, 0<t< ©, so that the integral curves 
do not intersect in this region. It is also easy to see that the integral curve 
x(t) is the limit of the integral curves x(t, z) lying below it and passing 
through points (0, z) such that 0 < z <1. In other words,°® 
x(t) = uae x(t, z). (17) 
ey 
The above analysis of the differential equation (9) has some interesting 
implications for the corresponding branching process E(t). In general, there 
is a positive probability that no particles at all are present at a given time f. 
Naturally, this cannot happen if %) = 0, since then particles can only be 
“created” but not “annihilated.” Clearly, the probability of all particles 
having disappeared after time ¢ is 


Polt) = F(t, 0) 
if there is only one particle originally present at time t = 0, and 
Pro(t) = [F(t OF = [poltI" 


if there are k particles at time t = 0. The function po(t) is the solution of the 
differential equation (9) corresponding to the parameter z = 0: 


42 — stp), (0) =0. 
As already shown, this solution asymptotically approaches some value py = « 
as t > 00, where « is the smaller root of the equation f(x) = 0 [recall (13)]. 
Thus py = « is the extinction probability of the branching process &(t), i.e., 
the probability that all the particles will eventually disappear. If the function 
f(x) is positive in the whole interval 0 < x <1, the extinction probability 
equals 1. 


® Note that x(t, z) = F(t, z) fort >0,0<7z<1. 
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There is also the possibility of an “explosion” in which infinitely many 
particles are created. The probability of an explosion occurring by time ¢ is 
just 


Pal!) =1— PE) <0} = 1-~$P (E) =n} 


© 
=1—Y p,(t) = 1 — lim F(t, z). 
n=0 ria 
In the case where x(t) = 1 is the unique integral curve of (9) passing through 
the point (0, 1), we clearly have 


lim F(t, z) = 1. 

271 
Therefore p,,(t) = 0 for arbitrary ¢ if (14) holds, and the probability of an 
explosion ever occurring is 0. However, if (16) holds, we have (17) where 
Xo(t) is the limiting integral curve described above and shown in Figure 13. 
In this case, 


Polt) = 1 — x(t) > 0 


and there is a positive probability of an explosion occurring. 


PROBLEMS 


1. A cosmic ray shower is initiated by a single particle entering the earth’s 
atmosphere. Find the probability p(t) of n particles being present after time ¢ 
if the probability of each particle producing a new particle in a small time 
interval At is XAr + o(Ar). 


Hint. dy = —d, ry = 2, 
Ans. py(t) = e*(1 — e**)n1, n > 1, 


2. Solve Problem 1 if each particle has probability AAr + o(Ar) of producing a 
new particle and probability 4Ar + o(Ar) of being annihilated in a small time 
interval Ar. 
Hint. dg = 4, = —A+ py), % =r 
Ans. polt) = vy, Pa(t) = (1 — Ay) — pyr)" (n > 1), 
where 
1 — eQ-wt ; 
pone ff Axo, 


t 
1+ At 


if 2 =p, 
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3. Find the extinction probability pp of the branching process in the preceding 
problem. 


ue 
= if d 

Ans. Po =\% pate) 
1 


if por 


Appendix 4 


PROBLEMS OF OPTIMAL CONTROL 


As in Sec. 15, consider a physical system which randomly changes its 
state at the times ¢ = 1, 2,... , starting from some initial state at time ¢ = 0. 
Let ¢,, e:,... be the possible states of the system, and &(t) the state of the 
system at time ¢, so that the evolution of the system in time is described by the 
consecutive transitions 


&(0) > &(1) > &(2)---. 
We will assume that &(t) is a Markov chain, whose transition probabilities 
Pus» 1, j = 1, 2,... depend on a “control parameter’ chosen step by step by 


an external “operator.” More exactly, if the system is in state ¢, at any time 
n and if d is the value of the control parameter chosen by the operator, then 
Pis = Pis(@) 
is the probability of the system going into the state ¢, at the next step. The 
set of all possible values of the control parameter d will be denoted by D. 
We now pose the problem of controlling this “guided random process” 
by bringing the system into a definite state, or more generally into one of a 
given set of states £, after a given number of steps n. Since the evolution of 
the process &(t) depends not only on the control exerted by the operator, but 
also on chance, there is usually only a definite probability P of bringing the 
system into one of the states of the set E, where P depends on the “control 
program” adopted by the operator. We will assume that every such control 
program consists in specifying in advance, for all c¢, and =0,...,n—1, 
the parameter 
d= dé;, t) 
136 
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to be chosen if the system is in the state ¢, at the time t. In other words, the 
whole control program is described by a decision rule, i.e., a function of two 
variables 

d= d(x,t), 


where x ranges over the states ¢,, ¢,,... and ¢ over the times 0,...,2 —1. 
Thus the probability of the system going into the state <, at time k + 1, 
given that it is in the state ¢, at time k, is given by 


Pig = Pi), d= d(e;,k). 


By the same token, the probability of the system being guided into one of the 
states in E depends on the choice of the control program, i.e., on the decision 
tule d = d(x, t), so that 

P= P(d). 


Control with a decision rule d° = d°(x, t) will be called optimal if 


P(d°) = max P(d), 
ad 


where the maximum is taken with respect to all possible control programs, 
i.e., all possible decision rules d = d(x, t). Our problem will be to find this 
optimal decision rule d°, thereby maximizing the probability 


P(d) = P {&(n) © E} 


of the system ending up in one of the states of E after n steps. 
We now describe a multistage procedure for finding d°, Let 


P(k, i, d) = P {E(n) € E| E(k) =} 


be the probability that after occupying the state ¢; at the kth step, the system 
will end up in one of the states of the set E after the remaining n — k steps (it 
is assumed that some original choice of the decision rule d = d(x, t) has 
been made). Then clearly 


P(k, i, d) = % Pi d)P(k + 1,j,d). (1) 


This is a simple consequence of the total probability formula, since at the 
(kK + l1)st step the system goes into the state ¢, with probability p,,(d), 
d = d(<,;, k), whence with probability P(k + 1, /, d) it moves on (n — k — 1 
steps later) to one of the states in the set E. 

For k = n — 1, formula (1) involves the probability 


P(,j,d) 1 if «,eE£, 2) 
n,jJ,a)= 

: 0 otherwise, ¢ 
and hence 


P(n — 1, i, d) = 2g Pld), (3) 
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where the summation is over all j such that the state e, belongs to the given 
set E. Obviously, P(n — 1,i, d) does not depend on values of the control 
parameter other than the values d(<,,m — 1) chosen at the time n — 1, 
Letting d° denote the value of the control parameter at which the function 
(3) takes its maximum,* we have 


Pn — 1, i) = P(n — 1, i, d°®) = max P(n — 1, i, d). (4) 
deD 


Clearly, there is a value d°= d%(c,,n — 1) corresponding to every pair 
(,.8 — 1) t= 1,2)... 
For k = n — 2, formula (1) becomes 


P(n — 2, i, d) => pyld)P(n — 1, j, d). 
ri 


Here the probabilities p,,(d) depend only on the values d = d(e;,n — 2) of 
the decision rule d = d(x, t) chosen at time n — 2, while the probabilities 
P(n — 1,j, d) depend only on the values d= d(e,,m — 1) chosen at time 
n — 1, Suppose we ‘‘correct” the decision rule d = d(x, t) by replacing the 
original values d(¢,, — 1) by the values d%e,, n — 1) just found. Then the 
corresponding probabilities P(n — 1, j, d) increase to their maximum values 
Pn — 1,j), thereby increasing the probability P(n — 2,7, d) to the value 


P(n — 2, i, d) = % PildPMn —1,)j). (5) 


Clearly, (5) depends on the decision rule d = d(t, x) only through the de- 
pendence of the transition probabilities p,,(d) on the values d = d(e;, n — 2) 
of the control parameter at time n — 2. Again letting d° denote the value of 
the control parameter at which the function (5) takes its maximum, we have 


Pn — 2, i) = P(n — 2, i, d°) = max P(n — 2, i, d). 
deD 


As before, there is a value d° = d(e,,n — 2) corresponding to every pair 


(c;,n —2),i=1,2,... Suppose we “correct” the decision rule d(x, t) by 
setting 

d(x, t) = d°(x, t) (6) 
for t= n—2,n—1 and all x =e, eg,... Then clearly the probabilities 


P(k, i, d) take their maximum values Pk, i) fori = 1,2,...andk =n —2, 
n— 1. Correspondingly, formula (1) becomes 


P(n — 3,i,d) = 2% Pld P(n —2,j,d)= D2 P,(a)P(n — 2, j), 


and this function of the control parameter d takes its maximum for some 
d° = de,,n — 3). We can then, once again, “correct” the decision rule 


+ It will be assumed that this maximum and the others considered below exist. 


APP. 4 PROBLEMS OF OPTIMAL CONTROL 139 


d = d(x, t) by requiring (6) to hold for t = n — 3 and all x = ¢,, e,..., a8 
well as fort =n —2,n—Jlandallx=«,, €,... 

Continuing this step-by-step procedure, after n — 1 steps we eventually 
get the optimal decision rule d = d(x, t), defined fort = 0,..., — 1 and 
all x = ¢,, €:,..., such that the probability P(d) = P(0, i, d) satisfying the 
initial condition (0) = e, achieves its maximum value. At the (mn — k)th 
step of this procedure of “‘successive corrections,” we find the value d° = 
@%(,, k) maximizing the function 


P(k, i, d) = 2 Pk PXk +1,3), 


where Pk + 1,j) is the maximum value of the probability P(k + 1, j, d). 
Carrying out this maximization, we get Bellman’s equation® 


P%\k, i) = max ¥ pi(d)P(k + 1, J), 
deD jj 
which summarizes the whole procedure just described. 


Example 1. Suppose there are just two states e, and ¢,, and suppose 
the transition probabilities are continuous functions of the control parameter 
in the intervals 

% < pul(d) < B;, Op < Pa(d) < Bp. 


What is the optimal decision rule maximizing the probability of the system, 
initially in the state ¢,, going into the state c, two steps later? 


Solution. In this case, 


P%1,1)=8,, P%1, 2) = Ba, 
PO, 1) = ae [Pi(@)B1 + Prs(d) Bo] = max [P11(4)(1 — Be) + Bal. 


If the system is initially in the state ¢,, then clearly we should maximize the 
transition probability p,, (by choosing py, = {) if 8; > By, while maximizing 
the transition probability py. = 1 — py, (by choosing py; = %) if By < By. 
There is an analogous optimal decision rule for the case where the initial 
state of the system is cp. 


Example 2 (The optimal choice problem). Once again we consider the 
optimal choice problem studied on pp. 28-29 and 86-87, corresponding to 


? In keeping with (2)-(4), we have 

1 if «EE, 

0 otherwise. 

* Clearly, any choice of p,; in the interval «, < pi: < 8; is optimal if 8, = By. 


P%(n,j) = 
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a Markov process &(t) with transition probabilities 


0 if i>j, 
i 
i Pepe (1) 
Ps=\U-Dij 
a ify =m sal 
m 


where, as on p. 28, choice of an object better than all those previously 
inspected causes the process &(0)—> &(1)— &(2)—>+++ to terminate. In 
each of the states ¢,..., €» (whose meaning is explained on p, 86), the 
observer decides whether to terminate or to continue the process of inspection. 
The decision to terminate, if taken in the state ¢,, is described formally by 
the transition probabilities 

1 if t=j, 8) 

EOS hg nde aaa ‘ 

while the decision to continue corresponds to the transition probabilities (7)- 
Hence we are dealing with a “guided Markov process,” whose transition 
probabilities p,; depend on the observer’s decision. Here the control param- 
eter d takes only two values, 0 and | say, where 0 corresponds to stopping 
the process and | to continuing it. Thus (8) gives the probabilities p,;(0) and 
(7) the probabilities p,,(1). 

Every inspection plan is described by a decision rule d= d(x), x = 
€1,.++,5€m, Which specifies in advance for each of the states €,..., &, 
whether inspection should be continued or terminated by selecting the last 
inspected object. The problem consists of finding an inspection plan, or 
equivalently a decision rule d= d(x), x =¢,...,€,, maximizing the 
probability of selecting the very best of all m objects. This probability is just 


P=S—p, ) 
7m 
where i/m is the probability that the ith inspected object is the best (recall p. 
29), p; is the probability that the process will stop in the state ¢,, and the 
summation is over all the states ¢, in which the decision rule d = d(x) calls for 
the process to stop. 

To find the optimal decision rule d° = d°(x) maximizing (9), we consider 
the probability P(k, d) of selecting the best object, given that the number of 
previously inspected objects is no less than k, i.e., given that the process 
&(t) actually occupies the state ¢,. By the total probability formula, we have 


PCk d) = 3 pus@Pd). (10) 
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Clearly, if the process occupies the state ¢,,, then the mth object is the best of 
the first m objects inspected and hence automatically the best of all 
objects. Therefore the optimal value of the decision rule d= d(x) for 
x =¢,, is just d%(<,,) = 0, and P(m, d) = 1 for this value. It follows from 
(9) and (10) that 


m1 if dens) =0, 
m 
P(m — 1, d) = (11) 
m—1 F = 
See if d(,,1)=1 


is the probability of choosing the best object, given that the process stops in 
the state <,, and the number of previously inspected objects is no less than 
m — 1. Moreover, (11) implies that the optimal value of the decision rule 
d= d(x) for x = €,,_; is d°(e,_,) = 0, and that 


at 
m 


Now suppose the optimum values of the decision rule d = d(x) are all 
zero for x = &,..., &m, corresponding to the fact that the process is termi- 
nated in any of the states ¢,,... , ¢,. Then what is the optimal value d°%(«,_,)? 
To answer this question, we note that (9) and (10) imply that 


P(k — 1,4) 


ae if d(e,1)=0, 


m 
ed, kale kt 
(k—Dkm* k(k+1) m (m—1)m 


‘1 if d@&.)=1 


is the probability of choosing the best object, given that the process stops 
in the states ¢,,...,¢,, and the number of previously inspected objects is 
no less than k — 1. It follows that the optimal value of the decision rule 
d= d(x) for x = ¢_, is 

: 1 1 1 

if ee Oe | fs iP 

ae ree "m—1 (12) 

1 otherwise. 


(E14) = 


Moreover, it is easy to see that the optimal decision rule d° = d°(x) has the 
structure 
dx) = 0 if X= Ems +s Sms 

1 eee ee Se NC 
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where m, is some integer. Thus the optimal selection procedure consists 
in continuing inspection until the appearance of an object numbered k > my 
which is better than all previously inspected objects. According to (12), 
mg is the largest positive integer such that 


1 i 1 
a) B'S) ibe remeron irae 13 
oa i ee (13) 
PROBLEMS 
1. In Example 2, prove that 
m 
My © = (14) 


if m is large, where e = 2.718. . . is the base of the natural logarithms, 
Hint. Use an integral to estimate the left-hand side of (13). 


2. Find the exact value of mm for m = 50. Compare the result with (14). 


3. Consider a Markov chain with two states ¢, and ¢, and transition proba- 
bilities p;;(d) depending on a control parameter d taking only two values 0 and 1. 
Suppose 

Pu) =, pu) =#%, pul) =%, pull) =%. 


What is the optimal decision rule maximizing the probability of the system 
initially in the state ¢,, going into the state ¢, three steps later? What is this 
maximum probability? 
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